# offline_reinforcement_learning_with_differential_privacy__48ff1d86.pdf

Offline Reinforcement Learning with Differential Privacy

Dan Qiao Department of Computer Science UC Santa Barbara Santa Barbara, CA 93106 danqiao@ucsb.edu

Yu-Xiang Wang Department of Computer Science UC Santa Barbara Santa Barbara, CA 93106 yuxiangw@cs.ucsb.edu

The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instancedependent learning bounds under both tabular and linear Markov Decision Process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.

1 Introduction

Offline Reinforcement Learning (or batch RL) aims to learn a near-optimal policy in an unknown environment1 through a static dataset gathered from some behavior policy µ. Since offline RL does not require access to the environment, it can be applied to problems where interaction with environment is infeasible, e.g., when collecting new data is costly (trade or finance [Zhang et al., 2020]), risky (autonomous driving [Sallab et al., 2017]) or illegal / unethical (healthcare [Raghu et al., 2017]). In such practical applications, the data used by an RL agent usually contains sensitive information. Take medical history for instance, for each patient, at each time step, the patient reports her health condition (age, disease, etc.), then the doctor decides the treatment (which medicine to use, the dosage of medicine, etc.), finally there is treatment outcome (whether the patient feels good, etc.) and the patient transitions to another health condition. Here, (health condition, treatment, treatment outcome) corresponds to (state, action, reward) and the dataset can be considered as n (number of patients) trajectories sampled from a MDP with horizon H (number of treatment steps). However, learning agents are known to implicitly memorize details of individual training data points verbatim [Carlini et al., 2019], even if they are irrelevant for learning [Brown et al., 2021], which makes offline RL models vulnerable to various privacy attacks.

Differential privacy (DP) [Dwork et al., 2006] is a well-established definition of privacy with many desirable properties. A differentially private offline RL algorithm will return a decision policy that is indistinguishable from a policy trained in an alternative universe any individual user is replaced, thereby preventing the aforementioned privacy risks. There is a surge of recent interest in developing RL algorithms with DP guarantees, but they focus mostly on the online setting [Vietri et al., 2020, Garcelon et al., 2021, Liao et al., 2021, Chowdhury and Zhou, 2021, Luyo et al., 2021].

1The environment is usually characterized by a Markov Decision Process (MDP) in this paper.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Offline RL is arguably more practically relevant than online RL in the applications with sensitive data. For example, in the healthcare domain, online RL requires actively running new exploratory policies (clinical trials) with every new patient, which often involves complex ethical / legal clearances, whereas offline RL uses only historical patient records that are often accessible for research purposes. Clear communication of the adopted privacy enhancing techniques (e.g., DP) to patients was reported to further improve data access [Kim et al., 2017].

Our contributions. In this paper, we present the first provably efficient algorithms for offline RL with differential privacy. Our contributions are twofold.

We design two new pessimism-based algorithms DP-APVI (Algorithm 1) and DP-VAPVI (Algorithm 2), one for the tabular setting (finite states and actions), the other for the case with linear function approximation (under linear MDP assumption). Both algorithms enjoy DP guarantees (pure DP or z CDP) and instance-dependent learning bounds where the cost of privacy appears as lower order terms. We perform numerical simulations to evaluate and compare the performance of our algorithm DP-VAPVI (Algorithm 2) with its non-private counterpart VAPVI [Yin et al., 2022] as well as a popular baseline PEVI [Jin et al., 2021]. The results complement the theoretical findings by demonstrating the practicality of DP-VAPVI under strong privacy parameters.

Related work. To our knowledge, differential privacy in offline RL tasks has not been studied before, except for much simpler cases where the agent only evaluates a single policy [Balle et al., 2016, Xie et al., 2019]. Balle et al. [2016] privatized first-visit Monte Carlo-Ridge Regression estimator by an output perturbation mechanism and Xie et al. [2019] used DP-SGD. Neither paper considered offline learning (or policy optimization), which is our focus.

There is a larger body of work on private RL in the online setting, where the goal is to minimize regret while satisfying either joint differential privacy [Vietri et al., 2020, Chowdhury and Zhou, 2021, Ngo et al., 2022, Luyo et al., 2021] or local differential privacy [Garcelon et al., 2021, Liao et al., 2021, Luyo et al., 2021, Chowdhury and Zhou, 2021]. The offline setting introduces new challenges in DP as we cannot algorithmically enforce good exploration , but have to work with a static dataset and privately estimate the uncertainty in addition to the value functions. A private online RL algorithm can sometimes be adapted for private offline RL too, but those from existing work yield suboptimal and non-adaptive bounds. We give a more detailed technical comparison in Appendix B.

Among non-private offline RL works, we build directly upon non-private offline RL methods known as Adaptive Pessimistic Value Iteration (APVI, for tabular MDPs) [Yin and Wang, 2021b] and Variance-Aware Pessimistic Value Iteration (VAPVI, for linear MDPs) [Yin et al., 2022], as they give the strongest theoretical guarantees to date. We refer readers to Appendix B for a more extensive review of the offline RL literature. Introducing DP to APVI and VAPVI while retaining the same sample complexity (modulo lower order terms) require nontrivial modifications to the algorithms.

A remark on technical novelty. Our algorithms involve substantial technical innovation over previous works on online DP-RL with joint DP guarantee2. Different from previous works, our DP-APVI (Algorithm 1) operates on Bernstein type pessimism, which requires our algorithm to deal with conditional variance using private statistics. Besides, our DP-VAPVI (Algorithm 2) replaces the LSVI technique with variance-aware LSVI (also known as weighted ridge regression, first appears in [Zhou et al., 2021]). Our DP-VAPVI releases conditional variance privately, and further applies weighted ridge regression privately. Both approaches ensure tighter instance-dependent bounds on the suboptimality of the learned policy.

2 Problem Setup

Markov Decision Process. A finite-horizon Markov Decision Process (MDP) is denoted by a tuple M = (S, A, P, r, H, d1) [Sutton and Barto, 2018], where S is state space and A is action space. A non-stationary transition kernel Ph : S A S 7 [0, 1] maps each state action (sh, ah) to a probability distribution Ph( |sh, ah) and Ph can be different across time. Besides, rh : S A 7 R is the expected immediate reward satisfying 0 rh 1, d1 is the initial state distribution and H is the horizon.

2Here we only compare our techniques (for offline RL) with the works for online RL under joint DP guarantee, as both settings allow access to the raw data.

A policy π = (π1, , πH) assigns each state sh S a probability distribution over actions according to the map sh 7 πh( |sh), h [H]. A random trajectory s1, a1, r1, , s H, a H, r H, s H+1 is generated according to s1 d1, ah πh( |sh), rh rh(sh, ah), sh+1 Ph( |sh, ah), h [H].

For tabular MDP, we have S A is the discrete state-action space and S := |S|, A := |A| are finite. In this work, we assume that r is known3. In addition, we denote the per-step marginal state-action occupancy dπ h(s, a) as: dπ h(s, a) := P[sh = s|s1 d1, π] πh(a|s), which is the marginal state-action probability at time h.

Value function, Bellman (optimality) equations. The value function V π h ( ) and Q-value function Qπ h( , ) for any policy π is defined as: V π h (s) = Eπ[PH t=h rt|sh = s], Qπ h(s, a) = Eπ[PH t=h rt|sh, ah = s, a], h, s, a [H] S A. The performance is defined as vπ :=

Ed1 [V π 1 ] = Eπ,d1 h PH t=1 rt i . The Bellman (optimality) equations follow h [H]: Qπ h =

rh + Ph V π h+1, V π h = Ea πh[Qπ h], Q h = rh + Ph V h+1, V h = maxa Q h( , a).

Linear MDP [Jin et al., 2020b]. An episodic MDP (S, A, H, P, r) is called a linear MDP with known feature map ϕ : S A Rd if there exist H unknown signed measures νh Rd over S and H unknown reward vectors θh Rd such that

Ph (s | s, a) = ϕ(s, a), νh (s ) , rh (s, a) = ϕ(s, a), θh , (h, s, a, s ) [H] S A S.

Without loss of generality, we assume ϕ(s, a) 2 1 and max( νh(S) 2, θh 2)

d for all h, s, a [H] S A. An important property of linear MDP is that the value functions are linear in the feature map, which is summarized in Lemma E.14.

Offline setting and the goal. The offline RL requires the agent to find a policy π in order to maximize the performance vπ, given only the episodic data D = {(sτ h, aτ h, rτ h, sτ h+1)}h [H] τ [n]

4 rolled out from some fixed and possibly unknown behavior policy µ, which means we cannot change µ and in particular we do not assume the functional knowledge of µ. In conclusion, based on the batch data D and a targeted accuracy ϵ > 0, the agent seeks to find a policy πalg such that v vπalg ϵ.

2.1 Assumptions in offline RL

In order to show that our privacy-preserving algorithms can generate near optimal policy, certain coverage assumptions are needed. In this section, we will list the assumptions we use in this paper.

Assumptions for tabular setting. Assumption 2.1 ([Liu et al., 2019]). There exists one optimal policy π , such that π is fully covered by µ, i.e. sh, ah S A, dπ h (sh, ah) > 0 only if dµ h(sh, ah) > 0. Furthermore, we denote the trackable set as Ch := {(sh, ah) : dµ h(sh, ah) > 0}.

Assumption 2.1 is the weakest assumption needed for accurately learning the optimal value v by requiring µ to trace the state-action space of one optimal policy (µ can be agnostic at other locations). Similar to [Yin and Wang, 2021b], we will use Assumption 2.1 for the tabular part of this paper, which enables comparison between our sample complexity to the conclusion in [Yin and Wang, 2021b], whose algorithm serves as a non-private baseline.

Assumptions for linear setting. First, we define the expectation of covariance matrix under the behavior policy µ for all time step h [H] as below:

Σp h := Eµ ϕ(sh, ah)ϕ(sh, ah) . (1)

As have been shown in [Wang et al., 2021a, Yin et al., 2022], learning a near-optimal policy from offline data requires coverage assumptions. Here in linear setting, such coverage is characterized by the minimum eigenvalue of Σp h. Similar to [Yin et al., 2022], we apply the following assumption for the sake of comparison. Assumption 2.2 (Feature Coverage, Assumption 2 in [Wang et al., 2021a]). The data distributions µ satisfy the minimum eigenvalue condition: h [H], κh := λmin(Σp h) > 0. Furthermore, we denote κ = minh κh.

3This is due to the fact that the uncertainty of reward function is dominated by that of transition kernel in RL. 4For clarity we use n for tabular MDP and K for linear MDP when referring to the sample complexity.

2.2 Differential Privacy in offline RL

In this work, we aim to design privacy-preserving algorithms for offline RL. We apply differential privacy as the formal notion of privacy. Below we revisit the definition of differential privacy. Definition 2.3 (Differential Privacy [Dwork et al., 2006]). A randomized mechanism M satisfies (ϵ, δ)-differential privacy ((ϵ, δ)-DP) if for all neighboring datasets U, U that differ by one data point and for all possible event E in the output range, it holds that

P[M(U) E] eϵ P[M(U ) E] + δ.

When δ = 0, we say pure DP, while for δ > 0, we say approximate DP.

In the problem of offline RL, the dataset consists of several trajectories, therefore one data point in Definition 2.3 refers to one single trajectory. Hence the definition of Differential Privacy means that the difference in the distribution of the output policy resulting from replacing one trajectory in the dataset will be small. In other words, an adversary can not infer much information about any single trajectory in the dataset from the output policy of the algorithm. Remark 2.4. For a concrete motivating example, please refer to the first paragraph of Introduction. We remark that our definition of DP is consistent with Joint DP and Local DP defined under the online RL setting where JDP/LDP also cast each user as one trajectory and provide user-wise privacy protection. For detailed definitions and more discussions about JDP/LDP, please refer to Qiao and Wang [2023a].

During the whole paper, we will use z CDP (defined below) as a surrogate for DP, since it enables cleaner analysis for privacy composition and Gaussian mechanism. The properties of z CDP (e.g., composition, conversion formula to DP) are deferred to Appendix E.3.

Definition 2.5 (z CDP [Dwork and Rothblum, 2016, Bun and Steinke, 2016]). A randomized mechanism M satisfies ρ-Zero-Concentrated Differential Privacy (ρ-z CDP), if for all neighboring datasets U, U and all α (1, ), Dα(M(U) M(U )) ρα, where Dα is the Renyi-divergence [Van Erven and Harremos, 2014].

Finally, we go over the definition and privacy guarantee of Gaussian mechanism. Definition 2.6 (Gaussian Mechanism [Dwork et al., 2014]). Define the ℓ2 sensitivity of a function f : NX 7 Rd as

2(f) = sup neighboring U,U f(U) f(U ) 2.

The Gaussian mechanism M with noise level σ is then given by

M(U) = f(U) + N(0, σ2Id).

Lemma 2.7 (Privacy guarantee of Gaussian mechanism [Dwork et al., 2014, Bun and Steinke, 2016]). Let f : NX 7 Rd be an arbitrary d-dimensional function with ℓ2 sensitivity 2. Then for any ρ > 0, Gaussian Mechanism with parameter σ2 = 2 2 2ρ satisfies ρ-z CDP. In addition, for all 0 < δ, ϵ < 1,

Gaussian Mechanism with parameter σ = 2

δ satisfies (ϵ, δ)-DP.

We emphasize that the privacy guarantee covers any input data. It does not require any distributional assumptions on the data. The RL-specific assumptions (e.g., linear MDP and coverage assumptions) are only used for establishing provable utility guarantees.

3 Results under tabular MDP: DP-APVI (Algorithm 1)

For reinforcement learning, the tabular MDP setting is the most well-studied setting and our first result applies to this regime. We begin with the construction of private counts.

Private Model-based Components. Given data D = {(sτ h, aτ h, rτ h, sτ h+1)}h [H] τ [n] , we denote nsh,ah := Pn τ=1 1[sτ h, aτ h = sh, ah] be the total counts that visit (sh, ah) pair at time h and nsh,ah,sh+1 :=

Pn τ=1 1[sτ h, aτ h, sτ h+1 = sh, ah, sh+1] be the total counts that visit (sh, ah, sh+1) pair at time h, then given the budget ρ for z CDP, we add independent Gaussian noises to all the counts:

n sh,ah = nsh,ah + N(0, σ2) + , n sh,ah,sh+1 = nsh,ah,sh+1 + N(0, σ2) + , σ2 = 2H

However, after adding noise, the noisy counts n may not satisfy n sh,ah = P

sh+1 S n sh,ah,sh+1. To address this problem, we choose the private counts of visiting numbers as the solution to the

following optimization problem (here Eρ = 4

H log 4HS2A

δ ρ is chosen as a high probability uniform

bound of the noises we add):

{ensh,ah,s }s S = argmin{xs }s S max s S xs n sh,ah,s

s S xs n sh,ah

2 and xs 0, s S.

ensh,ah = X

s S ensh,ah,s .

Remark 3.1 (Some explanations). The optimization problem above serves as a post-processing step which will not affect the DP guarantee of our algorithm. Briefly speaking, (3) finds a set of noisy counts such that ensh,ah = P

s S ensh,ah,s and the estimation error for each ensh,ah and ensh,ah,s is roughly Eρ.5 In contrast, if we directly take the crude approach that ensh,ah,sh+1 = n sh,ah,sh+1 and ensh,ah = P

sh+1 S ensh,ah,sh+1, we can only derive |ensh,ah nsh,ah| e O(

SEρ) through concentration on summation of S i.i.d. Gaussian noises. In conclusion, solving the optimization problem (3) enables tight analysis for the lower order term (the additional cost of privacy).

Remark 3.2 (Computational efficiency). The optimization problem (3) can be reformulated as:

min t, s.t. |xs n sh,ah,s | t and xs 0 s S,

s S xs n sh,ah

Note that (4) is a Linear Programming problem with S + 1 variables and 2S + 2 linear constraints (one constraint on absolute value is equivalent to two linear constraints), which can be solved efficiently by the simplex method [Ficken, 2015] or other provably efficient algorithms [Nemhauser and Wolsey, 1988]. Therefore, our Algorithm 1 is computationally friendly.

The private estimation of the transition kernel is defined as:

e Ph(s |sh, ah) = ensh,ah,s

ensh,ah , (5)

if ensh,ah > Eρ and e Ph(s |sh, ah) = 1

S otherwise.

Remark 3.3. Different from the transition kernel estimate in previous works [Vietri et al., 2020, Chowdhury and Zhou, 2021] that may not be a distribution, we have to ensure that ours is a probability distribution, because our Bernstein type pessimism (line 5 in Algorithm 1) needs to take variance over this transition kernel estimate. The intuition behind the construction of our private transition kernel is that, for those state-action pairs with ensh,ah Eρ, we can not distinguish whether the non-zero private count comes from noise or actual visitation. Therefore we only take the empirical estimate of the state-action pairs with sufficiently large ensh,ah.

5This conclusion is summarized in Lemma C.3.

Algorithm 1 Differentially Private Adaptive Pessimistic Value Iteration (DP-APVI)

1: Input: Offline dataset D = {(sτ h, aτ h, rτ h, sτ h+1)}n,H τ,h=1. Reward function r. Constants C1 =

2, C2 = 16, C > 1, failure probability δ, budget for z CDP ρ. 2: Initialization: Calculate ensh,ah, ensh,ah,sh+1 as (3), e Ph(sh+1|sh, ah) as (5). e VH+1( ) 0. Eρ

H log 4HS2A

δ ρ . ι log(HSA/δ).

3: for h = H, H 1, . . . , 1 do 4: e Qh( , ) rh( , ) + ( e Ph e Vh+1)( , )

5: sh, ah, let Γh(sh, ah) C1

Var e Psh,ah ( e Vh+1) ι

ensh,ah Eρ + C2SHEρ ι

ensh,ah if ensh,ah > Eρ, otherwise CH.

6: b Qp h( , ) e Qh( , ) Γh( , ). 7: Qh( , ) min{ b Qp h( , ), H h + 1}+. 8: sh, let bπh( |sh) argmaxπh Qh(sh, ), πh( |sh) and e Vh(sh) Qh(sh, ), bπh( |sh) . 9: end for 10: Output: {bπh}.

Algorithmic design. Our algorithmic design originates from the idea of pessimism, which holds conservative view towards the locations with high uncertainty and prefers the locations we have more confidence about. Based on the Bernstein type pessimism in APVI [Yin and Wang, 2021b], we design a similar pessimistic algorithm with private counts to ensure differential privacy. If we replace en and e P with n and b P 6, then our DP-APVI (Algorithm 1) will degenerate to APVI. Compared to the pessimism defined in APVI, our pessimistic penalty has an additional term e O SHEρ ensh,ah

, which accounts for the additional pessimism due to our application of private statistics.

We state our main theorem about DP-APVI below, the proof sketch is deferred to Appendix C.1 and detailed proof is deferred to Appendix C due to space limit. Theorem 3.4. DP-APVI (Algorithm 1) satisfies ρ-z CDP. Furthermore, under Assumption 2.1, denote dm := minh [H]{dµ h(sh, ah) : dµ h(sh, ah) > 0}. For any 0 < δ < 1, there exists constant c1 > 0, such that when n > c1 max{H2, Eρ}/ dm ι (ι = log(HSA/δ)), with probability 1 δ, the output policy bπ of DP-APVI satisfies

(sh,ah) Ch dπ h (sh, ah)

Var Ph( |sh,ah)(V h+1( )) ι ndµ h(sh, ah) + e O H3 + SH2Eρ

where e O hides constants and Polylog terms, Eρ = 4

H log 4HS2A

Comparison to non-private counterpart APVI [Yin and Wang, 2021b]. According to Theorem 4.1 in [Yin and Wang, 2021b], the sub-optimality bound of APVI is for large enough n, with high probability, the output bπ satisfies:

0 v vbπ e O

(sh,ah) Ch dπ h (sh, ah)

Var Ph( |sh,ah)(V h+1( )) ndµ h(sh, ah)

Compared to our Theorem 3.4, the additional sub-optimality bound due to differential privacy is

= e O SH 5 2 n dm ρ

= e O SH 5 2 n dmϵ

.7 In the most popular regime where the privacy budget

ρ or ϵ is a constant, the additional term due to differential privacy appears as a lower order term, hence becomes negligible as the sample complexity n becomes large.

Comparison to Hoeffding type pessimism. We can simply revise our algorithm by using Hoeffding type pessimism, which replaces the pessimism in line 5 with C1H q ι ensh,ah Eρ + C2SHEρ ι

ensh,ah . Then

6The non-private empirical estimate, defined as (15) in Appendix C. 7Here we apply the second part of Lemma 2.7 to achieve (ϵ, δ)-DP, the notation e O also absorbs log 1

δ (only here δ denotes the privacy budget instead of failure probability).

with a similar proof schedule, we can arrive at a sub-optimality bound that with high probability,

0 v vbπ e O

(sh,ah) Ch dπ h (sh, ah)

1 ndµ h(sh, ah)

+ e O SH2Eρ

Compared to our Theorem 3.4, our bound is tighter because we express the dominate term by the system quantities instead of explicit dependence on H (and Var Ph( |sh,ah)(V h+1( )) H2). In addition, we highlight that according to Theorem G.1 in [Yin and Wang, 2021b], our main term nearly matches the non-private minimax lower bound. For more detailed discussions about our main term and how it subsumes other optimal learning bounds, we refer readers to [Yin and Wang, 2021b].

Apply Laplace Mechanism to achieve pure DP. To achieve Pure DP instead of ρ-z CDP, we can simply replace Gaussian Mechanism with Laplace Mechanism (defined as Definition E.19). Given privacy budget for Pure DP ϵ, since the ℓ1 sensitivity of {nsh,ah} {nsh,ah,sh+1} is 1 = 4H, we can add independent Laplace noises Lap( 4H

ϵ ) to each count to achieve ϵ-DP due to Lemma E.20. Then by using Eϵ = e O H

ϵ instead of Eρ and keeping everything else ((3), (5) and Algorithm 1) the same, we can reach a similar result to Theorem 3.4 with the same proof schedule. The only difference is that here the additional learning bound is e O SH3

, which still appears as a lower order term.

4 Results under linear MDP: DP-VAPVI(Algorithm 2)

In large MDPs, to address the computational issues, the technique of function approximation is widely applied, and linear MDP is a concrete model to study linear function approximations. Our second result applies to the linear MDP setting. Generally speaking, function approximation reduces the dimensionality of private releases comparing to the tabular MDPs. We begin with private counts.

Private Model-based Components. Given the two datasets D and D (both from µ) as in Algorithm 2, we can apply variance-aware pessimistic value iteration to learn a near optimal policy as in VAPVI [Yin et al., 2022]. To ensure differential privacy, we add independent Gaussian noises to the 5H statistics as in DP-VAPVI (Algorithm 2) below. Since there are 5H statistics, by the adaptive composition of z CDP (Lemma E.17), it suffices to keep each count ρ0-z CDP, where ρ0 = ρ 5H . In DP-VAPVI, we use ϕ1, ϕ2, ϕ3, K1, K28 to denote the noises we add. For all ϕi, we directly apply Gaussian Mechanism. For Ki, in addition to the noise matrix 1

2(Z + Z ), we also add E

2 Id to ensure that all Ki are positive definite with high probability (The detailed definition of E, L can be found in Appendix A).

Below we will show the algorithmic design of DP-VAPVI (Algorithm 2). For the offline dataset, we divide it into two independent parts with equal length: D = {(sτ h, aτ h, rτ h, sτ h+1)}h [H] τ [K] and

D = {( sτ h, aτ h, rτ h, sτ h+1)}h [H] τ [K]. One for estimating variance and the other for calculating Q-values.

Estimating conditional variance. The first part (line 4 to line 8) aims to estimate the conditional variance of e Vh+1 via the definition of variance: [Varh e Vh+1](s, a) = [Ph(e Vh+1)2](s, a) ([Ph e Vh+1](s, a))2. For the first term, by the definition of linear MDP, it holds that h Ph e V 2 h+1 i (s, a) = ϕ(s, a) R

S e V 2 h+1 (s ) dνh (s ) = ϕ, R

S e V 2 h+1 (s ) dνh (s ) . We can estimate

S e V 2 h+1 (s ) dνh (s ) by applying ridge regression. Below is the output of ridge regression with raw statistics without noise:

argmin β Rd

h D ϕ( sk h, ak h), β E e V 2 h+1 sk h+1 i2 + λ β 2 2 = Σ 1 h

k=1 ϕ( sk h, ak h)e V 2 h+1 sk h+1 ,

8We need to add noise to each of the 5H counts, therefore for ϕ1, we actually sample H i.i.d samples ϕ1,h, h = 1, , H from the distribution of ϕ1. Then we add ϕ1,h to PK τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2, h [H]. For simplicity, we use ϕ1 to represent all the ϕ1,h. The procedure applied to the other 4H statistics are similar.

Algorithm 2 Differentially Private Variance-Aware Pessimistic Value Iteration (DP-VAPVI)

1: Input: Dataset D = {(sτ h, aτ h, rτ h, sτ h+1)}K,H τ,h=1 D = {( sτ h, aτ h, rτ h, sτ h+1)}K,H τ,h=1. Budget for z CDP ρ. Failure probability δ. Universal constant C. 2: Initialization: Set ρ0 ρ 5H , e VH+1( ) 0. Sample ϕ1 N 0, 2H4

ρ0 Id , ϕ2, ϕ3 N 0, 2H2

2(Z + Z ), where Zi,j N 0, 1 4ρ0

(i.i.d.), E = e O q

d κ3/2 + H3

d . 3: for h = H, H 1, . . . , 1 do 4: Set eΣh PK τ=1 ϕ( sτ h, aτ h)ϕ( sτ h, aτ h) + λI + K1 5: Set eβh eΣ 1 h [PK τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2 + ϕ1] 6: Set eθh eΣ 1 h [PK τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1) + ϕ2] 7: Set g Varh e Vh+1 ( , ) ϕ( , ), eβh [0,(H h+1)2] ϕ( , ), eθh

[0,H h+1] 2

8: Set eσh( , )2 max{1, g Varh e Vh+1( , )} 9: Set eΛh PK τ=1 ϕ (sτ h, aτ h) ϕ (sτ h, aτ h) /eσ2 h(sτ h, aτ h) + λI + K2 10: Set ewh eΛ 1 h PK τ=1 ϕ (sτ h, aτ h) rτ h + e Vh+1 (sτ h+1) /eσ2 h(sτ h, aτ h) + ϕ3

11: Set Γh( , ) C

d ϕ( , ) eΛ 1 h ϕ( , ) 1/2 + D

K 12: Set Qh( , ) ϕ( , ) ewh Γh( , ) 13: Set b Qh( , ) min Qh( , ), H h + 1 +

14: Set bπh( | ) argmaxπh b Qh( , ), πh( | )

A, e Vh( ) maxπh b Qh( , ), πh( | )

A 15: end for 16: Output: {bπh}H h=1.

where definition of Σh can be found in Appendix A. Instead of using the raw statistics, we replace them with private ones with Gaussian noises as in line 5. The second term is estimated similarly in line 6. The final estimator is defined as in line 8: eσh( , )2 = max{1, g Varh e Vh+1( , )}.9

Variance-weighted LSVI. Instead of directly applying LSVI [Jin et al., 2021], we can solve the variance-weighted LSVI (line 10). The result of variance-weighted LSVI with non-private statistics is shown below:

argmin w Rd λ w 2 2+

h ϕ(sk h, ak h), w rk h e Vh+1(sk h+1) i2

eσ2 h(sk h, ak h) = bΛ 1 h

ϕ sk h, ak h h rk h + e Vh+1 sk h+1 i

eσ2 h(sk h, ak h) ,

where definition of bΛh can be found in Appendix A. For the sake of differential privacy, we use private statistics instead and derive the ewh as in line 10.

Our private pessimism. Notice that if we remove all the Gaussian noises we add, our DP-VAPVI (Algorithm 2) will degenerate to VAPVI [Yin et al., 2022]. We design a similar pessimistic penalty using private statistics (line 11), with additional D

K accounting for the extra pessimism due to DP.

Main theorem. We state our main theorem about DP-VAPVI below, the proof sketch is deferred to Appendix D.1 and detailed proof is deferred to Appendix D due to space limit. Note that quantities Mi, L, E can be found in Appendix A and briefly, L = e O( p

H3d/ρ), E = e O( p

Hd/ρ). For the sample complexity lower bound, within the practical regime where the privacy budget is not very small, max{Mi} is dominated by max{ e O(H12d3/κ5), e O(H14d/κ5)}, which also appears in the sample complexity lower bound of VAPVI [Yin et al., 2022]. The σ2 V (s, a) in Theorem 4.1 is defined as max{1, Var Ph(V )(s, a)} for any V .

Theorem 4.1. DP-VAPVI (Algorithm 2) satisfies ρ-z CDP. Furthermore, let K be the number of episodes. Under the condition that K > max{M1, M2, M3, M4} and

d > ξ, where ξ :=

sup V [0,H], s Ph(s,a), h [H]

rh+V (s ) (Th V )(s,a)

, for any 0 < λ < κ, with probability 1 δ, for

9The max{1, } operator here is for technical reason only: we want a lower bound for each variance estimate.

all policy π simultaneously, the output bπ of DP-VAPVI satisfies

ϕ( , ) Λ 1 h ϕ( , ) !

where Λh = PK k=1 ϕ(sk h,ak h) ϕ(sk h,ak h)

σ2 e Vh+1(sk h,ak h) + λId, D = e O H2L

d κ3/2 + H3

d and e O hides

constants and Polylog terms. In particular, define Λ h = PK k=1 ϕ(sk h,ak h) ϕ(sk h,ak h)

σ2 V h+1(sk h,ak h) + λId, we have with probability 1 δ,

ϕ( , ) Λ 1 h ϕ( , ) !

Comparison to non-private counterpart VAPVI [Yin et al., 2022]. Plugging in the definition of L, E (Appendix A), under the meaningful case that the privacy budget is not very large, DH is

dominated by e O H 11

2 d/κ 3 2 ρ

. According to Theorem 3.2 in [Yin et al., 2022], the sub-optimality

bound of VAPVI is for sufficiently large K, with high probability, the output bπ satisfies:

ϕ( , ) Λ 1 h ϕ( , ) !

Compared to our Theorem 4.1, the additional sub-optimality bound due to differential privacy is

2 d/κ 3 2 ρ K

2 d/κ 3 2 ϵ K

.10 In the most popular regime where the privacy budget ρ or ϵ is

a constant, the additional term due to differential privacy also appears as a lower order term.

Instance-dependent sub-optimality bound. Similar to DP-APVI (Algorithm 1), our DP-VAPVI (Algorithm 2) also enjoys instance-dependent sub-optimality bound. First, the main term in (10) improves PEVI [Jin et al., 2021] over O(

d) on feature dependence. Also, our main term admits no explicit dependence on H, thus improves the sub-optimality bound of PEVI on horizon dependence. For more detailed discussions about our main term, we refer readers to [Yin et al., 2022].

Private and safe policy improvement. In addition to the sub-optimality bound (10), we have the so called oracle inequality (9). Therefore, the performance vbπ can be lower bounded by

d PH h=1 Eπ

ϕ( , ) Λ 1 h ϕ( , ) DH

. When choosing π to be the opti-

mal policy in the neighborhood of the behavior policy µ, our DP-VAPVI (Algorithm 2) sheds light on safe policy improvement with differential privacy guarantee.

5 Tightness of our results

We believe our bounds for offline RL with DP is tight. To the best of our knowledge, APVI and VAPVI provide the tightest bound under tabular MDP and linear MDP, respectively. The suboptimality bounds of our algorithms match these two in the main term, with some lower order additional terms. The leading terms are known to match multiple information-theoretical lower bounds for offline RL simultaneously (this was illustrated in Yin and Wang [2021b], Yin et al. [2022]), for this reason our bound cannot be improved in general. For the lower order terms, the dependence on sample complexity n and privacy budget ϵ: e O( 1

nϵ) is optimal since policy learning is a special case of ERM problems and such dependence is optimal in DP-ERM. In addition, we believe the dependence on other parameters (H, S, A, d) in the lower order term is tight due to our special tricks as (3) and Lemma D.6.

10Here we apply the second part of Lemma 2.7 to achieve (ϵ, δ)-DP, the notation e O also absorbs log 1

δ (only here δ denotes the privacy budget instead of failure probability).

6 Simulations

In this section, we carry out simulations to evaluate the performance of our DP-VAPVI (Algorithm 2), and compare it with its non-private counterpart VAPVI [Yin et al., 2022] and another pessimism-based algorithm PEVI [Jin et al., 2021] which does not have privacy guarantee.

Experimental setting. We evaluate DP-VAPVI (Algorithm 2) on a synthetic linear MDP example that originates from the linear MDP in [Min et al., 2021, Yin et al., 2022] but with some modifications.11 For details of the linear MDP setting, please refer to Appendix F. The two MDP instances we use both have horizon H = 20. We compare different algorithms in figure 1(a), while in figure 1(b), we compare our DP-VAPVI with different privacy budgets. When doing empirical evaluation, we do not split the data for DP-VAPVI or VAPVI and for DP-VAPVI, we run the simulation for 5 times and take the average performance.

0 200 400 600 800 1000 Number of episodes

Suboptimality gap

PEVI VAPVI DP-VAPVI, =1 DP-VAPVI, =25

(a) Compare different algorithms, H = 20

0 200 400 600 800 1000 Number of episodes

Suboptimality gap

VAPVI DP-VAPVI, =0.1 DP-VAPVI, =1 DP-VAPVI, =5 DP-VAPVI, =25

(b) Different privacy budgets, H = 20

Figure 1: Comparison between performance of PEVI, VAPVI and DP-VAPVI (with different privacy budgets) under the linear MDP example described above. In each figure, y-axis represents suboptimality gap v vbπ while x-axis denotes the number of episodes K. The horizons are fixed to be H = 20. The number of episodes takes value from 5 to 1000.

Results and discussions. From Figure 1, we can observe that DP-VAPVI (Algorithm 2) performs slightly worse than its non-private version VAPVI [Yin et al., 2022]. This is due to the fact that we add Gaussian noise to each count. However, as the size of dataset goes larger, the performance of DP-VAPVI will converge to that of VAPVI, which supports our theoretical conclusion that the cost of privacy only appears as lower order terms. For DP-VAPVI with larger privacy budget, the scale of noise will be smaller, thus the performance will be closer to VAPVI, as shown in figure 1(b). Furthermore, in most cases, DP-VAPVI still outperforms PEVI, which does not have privacy guarantee. This arises from our privitization of variance-aware LSVI instead of LSVI.

7 Conclusion and future works

In this work, we take the first steps towards the well-motivated task of designing private offline RL algorithms. We propose algorithms for both tabular MDPs and linear MDPs, and show that they enjoy instance-dependent sub-optimality bounds while guaranteeing differential privacy (either z CDP or pure DP). Our results highlight that the cost of privacy only appears as lower order terms, thus become negligible as the number of samples goes large.

Future extensions are numerous. We believe the technique in our algorithms (privitization of Bernstein-type pessimism and variance-aware LSVI) and the corresponding analysis can be used in online settings too to obtain tighter regret bounds for private algorithms. For the offline RL problems, we plan to consider more general function approximations and differentially private (deep) offline RL which will bridge the gap between theory and practice in offline RL applications. Many techniques we developed could be adapted to these more general settings.

11We keep the state space S = {1, 2}, action space A = {1, , 100} and feature map of state-action pairs while we choose stochastic transition (instead of the original deterministic transition) and more complex reward.

Acknowledgments

The research is partially supported by NSF Awards #2007117 and #2048091. The authors would like to thank Ming Yin for helpful discussions.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

Naman Agarwal and Karan Singh. The price of differential privacy for online learning. In International Conference on Machine Learning, pages 32 40. PMLR, 2017.

Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463 474. PMLR, 2020.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263 272. JMLR. org, 2017.

Borja Balle, Maziar Gomrokchi, and Doina Precup. Differentially private policy evaluation. In International Conference on Machine Learning, pages 2130 2138. PMLR, 2016.

Debabrota Basu, Christos Dimitrakakis, and Aristide Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost? ar Xiv preprint ar Xiv:1905.12298, 2019.

Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. When is memorization of irrelevant training data necessary for high-accuracy learning? In ACM SIGACT Symposium on Theory of Computing, pages 123 132, 2021.

Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635 658. Springer, 2016.

Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283 1294. PMLR, 2020.

Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium (USENIX Security 19), pages 267 284, 2019.

Xiaoyu Chen, Kai Zheng, Zixin Zhou, Yunchang Yang, Wei Chen, and Liwei Wang. (locally) differentially private combinatorial semi-bandits. In International Conference on Machine Learning, pages 1757 1767. PMLR, 2020.

Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493 507, 1952.

Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes. ar Xiv preprint ar Xiv:2112.10599, 2021.

Sayak Ray Chowdhury, Xingyu Zhou, and Ness Shroff. Adaptive control of differentially private linear quadratic systems. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 485 490. IEEE, 2021.

Chris Cundy and Stefano Ermon. Privacy-constrained policies via mutual information regularized policy gradients. ar Xiv preprint ar Xiv:2012.15019, 2020.

Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713 5723, 2017.

Cynthia Dwork and Guy N Rothblum. Concentrated differential privacy. ar Xiv preprint ar Xiv:1603.01887, 2016.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265 284. Springer, 2006.

Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211 407, 2014.

Frederick Arthur Ficken. The simplex method of linear programming. Courier Dover Publications, 2015.

Pratik Gajane, Tanguy Urvoy, and Emilie Kaufmann. Corrupt bandits for preserving local privacy. In Algorithmic Learning Theory, pages 387 412. PMLR, 2018.

Minbo Gao, Tianle Xie, Simon S Du, and Lin F Yang. A provably efficient algorithm for linear markov decision process with low switching cost. ar Xiv preprint ar Xiv:2101.00494, 2021.

Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differential privacy for regret minimization in reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.

Abhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private online learning in full-information and bandit settings. Advances in Neural Information Processing Systems, 26, 2013.

Bingshan Hu, Zhiming Huang, and Nishant A Mehta. Optimal algorithms for private online learning in a stochastic environment. ar Xiv preprint ar Xiv:2102.07929, 2021.

Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870 4879. PMLR, 2020a.

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137 2143. PMLR, 2020b.

Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084 5096. PMLR, 2021.

Hyeoneui Kim, Elizabeth Bell, Jihoon Kim, Amy Sitapati, Joe Ramsdell, Claudiu Farcas, Dexter Friedman, Stephanie Feudjio Feupe, and Lucila Ohno-Machado. iconcur: informed consent for clinical data and bio-sample use for research. Journal of the American Medical Informatics Association, 24(2):380 387, 2017.

Jonathan Lebensold, William Hamilton, Borja Balle, and Doina Precup. Actor critic with differentially private critic. ar Xiv preprint ar Xiv:1910.05876, 2019.

Chonghua Liao, Jiafan He, and Quanquan Gu. Locally differentially private reinforcement learning for linear mixture markov decision processes. ar Xiv preprint ar Xiv:2110.10133, 2021.

Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. In Uncertainty in Artificial Intelligence, 2019.

Paul Luyo, Evrard Garcelon, Alessandro Lazaric, and Matteo Pirotta. Differentially private exploration in reinforcement learning with linear representation. ar Xiv preprint ar Xiv:2112.01585, 2021.

Sunil Madhow, Dan Qiao, Ming Yin, and Yu-Xiang Wang. Offline policy evaluation for reinforcement learning with adaptively collected data. ar Xiv preprint ar Xiv:2306.14063, 2023.

Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. Conference on Learning Theory, 2009.

Yifei Min, Tianhao Wang, Dongruo Zhou, and Quanquan Gu. Variance-aware off-policy evaluation with linear function approximation. Advances in neural information processing systems, 34, 2021.

George Nemhauser and Laurence Wolsey. Polynomial-time algorithms for linear programming. Integer and Combinatorial Optimization, pages 146 181, 1988.

Dung Daniel Ngo, Giuseppe Vietri, and Zhiwei Steven Wu. Improved regret for differentially private exploration in linear mdp. ar Xiv preprint ar Xiv:2202.01292, 2022.

Hajime Ono and Tsubasa Takahashi. Locally private distributed reinforcement learning. ar Xiv preprint ar Xiv:2001.11718, 2020.

Dan Qiao and Yu-Xiang Wang. Near-optimal differentially private reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 9914 9940. PMLR, 2023a.

Dan Qiao and Yu-Xiang Wang. Near-optimal deployment efficiency in reward-free reinforcement learning with linear function approximation. In The Eleventh International Conference on Learning Representations, 2023b.

Dan Qiao, Ming Yin, Ming Min, and Yu-Xiang Wang. Sample-efficient reinforcement learning with loglog(T) switching cost. In Proceedings of the 39th International Conference on Machine Learning, pages 18031 18061. PMLR, 2022.

Dan Qiao, Ming Yin, and Yu-Xiang Wang. Logarithmic switching cost in reinforcement learning beyond linear mdps. ar Xiv preprint ar Xiv:2302.12456, 2023.

Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach. In Machine Learning for Healthcare Conference, pages 147 163, 2017.

Rachel Redberg and Yu-Xiang Wang. Privately publishable per-instance privacy. Advances in Neural Information Processing Systems, 34, 2021.

Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70 76, 2017.

Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. Advances in Neural Information Processing Systems, 31, 2018.

Karthik Sridharan. A gentle introduction to concentration inequalities. Dept. Comput. Sci., Cornell Univ., Tech. Rep, 2002.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Aristide Charles Yedia Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.

Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014.

Giuseppe Vietri, Borja Balle, Akshay Krishnamurthy, and Steven Wu. Private reinforcement learning with pac and regret guarantees. In International Conference on Machine Learning, pages 9754 9764. PMLR, 2020.

Baoxiang Wang and Nidhi Hegde. Privacy-preserving q-learning with functional noise in continuous spaces. Advances in Neural Information Processing Systems, 32, 2019.

Ruosong Wang, Dean P Foster, and Sham M Kakade. What are the statistical limits of offline rl with linear function approximation? International Conference on Learning Representations, 2021a.

Tianhao Wang, Dongruo Zhou, and Quanquan Gu. Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. Advances in Neural Information Processing Systems, 34:13524 13536, 2021b.

Tengyang Xie, Philip S Thomas, and Gerome Miklau. Privacy preserving off-policy evaluation. ar Xiv preprint ar Xiv:1902.00174, 2019.

Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 2021a.

Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 2021b.

Jianyu Xu, Dan Qiao, and Yu-Xiang Wang. Doubly fair dynamic pricing. In International Conference on Artificial Intelligence and Statistics, pages 9941 9975. PMLR, 2023.

Ming Yin and Yu-Xiang Wang. Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3948 3958. PMLR, 2020.

Ming Yin and Yu-Xiang Wang. Optimal uniform ope and model-based offline reinforcement learning in time-homogeneous, reward-free and task-agnostic settings. Advances in neural information processing systems, 2021a.

Ming Yin and Yu-Xiang Wang. Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34, 2021b.

Ming Yin, Yu Bai, and Yu-Xiang Wang. Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1567 1575. PMLR, 2021.

Ming Yin, Yaqi Duan, Mengdi Wang, and Yu-Xiang Wang. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. ar Xiv preprint ar Xiv:2203.05804, 2022.

Andrea Zanette. Exponential lower bounds for batch reinforcement learning: Batch rl can be exponentially harder than online rl. In International Conference on Machine Learning, pages 12287 12297. PMLR, 2021.

Andrea Zanette, Martin J Wainwright, and Emma Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 2021.

Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deep reinforcement learning for trading. The Journal of Financial Data Science, 2(2):25 40, 2020.

Fuheng Zhao, Dan Qiao, Rachel Redberg, Divyakant Agrawal, Amr El Abbadi, and Yu-Xiang Wang. Differentially private linear sketches: Efficient implementations and applications. Advances in Neural Information Processing Systems, 35:12691 12704, 2022.

Kai Zheng, Tianle Cai, Weiran Huang, Zhenguo Li, and Liwei Wang. Locally differentially private (contextual) bandits learning. Advances in Neural Information Processing Systems, 33:12300 12310, 2020.

Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532 4576. PMLR, 2021.

Xingyu Zhou. Differentially private reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:2201.07052, 2022.

A Notation List

A.1 Notations for tabular MDP

H log 4HS2A

n The original counts of visitation

n The noisy counts, as defined in (2)

en Final choice of private counts, as defined in (3) e P Private estimate of transition kernel, as defined in (5) b P Non-private estimate of transition kernel, as defined in (15)

δ ρ Budget for z CDP

δ Failure probability

A.2 Notations for linear MDP

5Hd log( 10Hd

2 + log(5c1H/δ)

d κ3/2 + H3

bΛh PK k=1 ϕ(sk h, ak h)ϕ(sk h, ak h) /eσ2 h(sk h, ak h) + λId eΛh PK k=1 ϕ(sk h, ak h)ϕ(sk h, ak h) /eσ2 h(sk h, ak h) + λId + K2 eΛp h Eµ,h[eσ 2 h (s, a)ϕ(s, a)ϕ(s, a) ]

Λh PK τ=1 ϕ(sτ h, aτ h)ϕ(sτ h, aτ h) /σ2 e Vh+1(sτ h, aτ h) + λI

Λp h Eµ,h[σ 2 e Vh+1(s, a)ϕ(s, a)ϕ(s, a) ]

Λ h PK τ=1 ϕ(sτ h, aτ h)ϕ(sτ h, aτ h) /σ2 V h+1(sτ h, aτ h) + λI Σh PK τ=1 ϕ( sτ h, aτ h)ϕ( sτ h, aτ h) + λId eΣh PK τ=1 ϕ( sτ h, aτ h)ϕ( sτ h, aτ h) + λId + K1 Σp h Eµ,h ϕ(s, a)ϕ(s, a)

κ minh λmin(Σp h)

σ2 V (s, a) max{1, Var Ph(V )(s, a)} for any V

σ 2 h (s, a) max 1, Var Ph V h+1(s, a)

eσ2 h(s, a) max{1, g Varh e Vh+1(s, a)}

M1 max{2λ, 128 log(2d H/δ), 128H4 log(2d H/δ)

M2 max{ e O(H12d3/κ5), e O(H14d/κ5)}

M3 max 512H4 log( 2d H

δ ) κ2 , 4λH2

M4 max{ H2L2

ρ Budget for z CDP

δ Failure probability (not the δ of (ϵ, δ)-DP)

ξ sup V [0,H], s Ph(s,a), h [H]

rh+V (s ) (Th V )(s,a)

B Extended related work

Online reinforcement learning under JDP or LDP. For online RL, some recent works analyze this setting under Joint Differential Privacy (JDP), which requires the RL agent to minimize regret while handling user s raw data privately. Under tabular MDP, Vietri et al. [2020] design PUCB by revising UBEV [Dann et al., 2017]. Private-UCB-VI [Chowdhury and Zhou, 2021] results from UCBVI (with bonus-1) [Azar et al., 2017]. However, both works privatize Hoeffding type bonus, which lead to sub-optimal regret bound. Under linear MDP, Private LSVI-UCB [Ngo et al., 2022] and Privacy-Preserving LSVI-UCB [Luyo et al., 2021] are private versions of LSVI-UCB [Jin et al., 2020b], while Lin Opt-VI-Reg [Zhou, 2022] and Privacy-Preserving UCRL-VTR [Luyo et al., 2021] generalize UCRL-VTR [Ayoub et al., 2020]. However, these works are usually based on the LSVI technique [Jin et al., 2020b] (unweighted ridge regression), which does not ensure optimal regret bound.

In addition to JDP, another common privacy guarantee for online RL is Local Differential Privacy (LDP), LDP is a stronger definition of DP since it requires that the user s data is protected before the RL agent has access to it. Under LDP, Garcelon et al. [2021] reach a regret lower bound and design LDP-OBI which has matching regret upper bound. The result is generalized by Liao et al. [2021] to linear mixture setting. Later, Luyo et al. [2021] provide an unified framework for analyzing JDP and LDP under linear setting.

Some other differentially private learning algorithms. There are some other works about differentially private online learning [Guha Thakurta and Smith, 2013, Agarwal and Singh, 2017, Hu et al., 2021] and various settings of bandit [Shariff and Sheffet, 2018, Gajane et al., 2018, Basu et al., 2019, Zheng et al., 2020, Chen et al., 2020, Tossou and Dimitrakakis, 2017]. For the reinforcement learning setting, Wang and Hegde [2019] propose privacy-preserving Q-learning to protect the reward information. Ono and Takahashi [2020] study the problem of distributed reinforcement learning under LDP. Lebensold et al. [2019] present an actor critic algorithm with differentially private critic. Cundy and Ermon [2020] tackle DP-RL under the policy gradient framework. Chowdhury et al. [2021] consider the adaptive control of differentially private linear quadratic (LQ) systems. Zhao et al. [2022] designed differentially private linear sketch algorirthms.

Offline reinforcement learning under tabular MDP. Under tabular MDP, there are several works achieving optimal sub-optimality/sample complexity bounds under different coverage assumptions. For the problem of off-policy evaluation (OPE), Yin and Wang [2020] uses Tabular-MIS estimator to achieve asymptotic efficiency. In addition, the idea of uniform OPE is used to achieve the optimal sample complexity O(H3/dmϵ2) [Yin et al., 2021] for non-stationary MDP and the optimal sample complexity O(H2/dmϵ2) [Yin and Wang, 2021a] for stationary MDP, where dm is the lower bound for state-action occupancy. Such uniform convergence idea also supports some works regarding online exploration [Jin et al., 2020a, Qiao et al., 2022, Xu et al., 2023]. For offline RL with single concentrability assumption, Xie et al. [2021b] arrive at the optimal sample complexity O(H3SC /ϵ2). Recently, Yin and Wang [2021b] propose APVI which can lead to instance-dependent sub-optimality bound, which subsumes previous optimal results under several assumptions. Madhow et al. [2023] consider offline policy evaluation for adaptively collected datasets.

Offline reinforcement learning under linear MDP. Recently, many works focus on offline RL under linear representation. Jin et al. [2021] present PEVI which applies the idea of pessimistic value iteration (the idea originates from [Jin et al., 2020b]), and PEVI is provably efficient for offline RL under linear MDP. Yin et al. [2022] improve the sub-optimality bound in [Jin et al., 2021] by replacing LSVI by variance-weighted LSVI. Xie et al. [2021a] consider Bellman consistent pessimism for general function approximation, and their result improves the sample complexity in [Jin et al., 2021] by order O(d) (shown in Theorem 3.2). However, there is no improvement on horizon dependence. Zanette et al. [2021] propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle. Besides, Wang et al. [2021a], Zanette [2021] study the statistical hardness of offline RL with linear representations by presenting exponential lower bounds. When relaxing the offline setting to low adaptive RL, Gao et al. [2021], Wang et al. [2021b], Qiao and Wang [2023b], Qiao et al. [2023] designed algorithms with low adaptivity.

C Proof of Theorem 3.4

C.1 Proof sketch

Since the whole proof for privacy guarantee is not very complex, we present it in Section C.2 below and only sketch the proof for suboptimality bound.

First of all, we bound the scale of noises we add to show that the en derived from (3) are close to real visitation numbers. Therefore, denoting the non-private empirical transition kernel by b P (detailed definition in (15)), we can show that e P b P 1 and | p

Var e P (V ) p

Var b P (V )| are small.

Next, resulting from the conditional independence of e Vh+1 and e Ph, we apply Empirical Bernstein s

inequality to get |( e Ph Ph)e Vh+1| q

Var e P (e Vh+1)/ensh,ah + SHEρ/ensh,ah. Together with our definition of private pessimism and the key lemma: extended value difference (Lemma E.7 and E.8), we can bound the suboptimality of our output policy bπ by:

(sh,ah) Ch dπ h (sh, ah)

v u u t Var e Ph( |sh,ah)(e Vh+1( ))

ensh,ah + SHEρ/ensh,ah. (12)

Finally, we further bound the above suboptimality via replacing private statistics by non-private ones. Specifically, we replace en by n, e P by P and e V by V . Due to (12), we have e V V q

Together with the upper bounds of e P b P 1 and | p

Var e P (V ) p

Var b P (V )|, we have v u u t Var e Ph( |sh,ah)(e Vh+1( ))

Var e Ph( |sh,ah)(V h+1( ))

ensh,ah + 1 n dm

Var b Ph( |sh,ah)(V h+1( ))

ensh,ah + 1 n dm

Var Ph( |sh,ah)(V h+1( )) ensh,ah + 1 n dm

Var Ph( |sh,ah)(V h+1( )) ndµ h(sh, ah) + 1 n dm .

The final bound using non-private statistics results from (12) and (13).

C.2 Proof of the privacy guarantee

The privacy guarantee of DP-APVI (Algorithm 1) is summarized by Lemma C.1 below. Lemma C.1 (Privacy analysis of DP-APVI (Algorithm 1)). DP-APVI (Algorithm 1) satisfies ρ-z CDP.

Proof of Lemma C.1. The ℓ2 sensitivity of {nsh,ah} is

2H. According to Lemma 2.7, the Gaussian Mechanism used on {nsh,ah} with σ2 = 2H

ρ satisfies ρ

2-z CDP. Similarly, the Gaussian Mechanism used on {nsh,ah,sh+1} with σ2 = 2H

ρ also satisfies ρ

2-z CDP. Combining these two results, due to the composition of z CDP (Lemma E.16), the construction of {n } satisfies ρ-z CDP. Finally, DP-APVI satisfies ρ-z CDP because the output bπ is post processing of {n }.

C.3 Proof of the sub-optimality bound

C.3.1 Utility analysis

First of all, the following Lemma C.2 gives a high probability bound for |n n|.

Lemma C.2. Let Eρ = 2

H log 4HS2A

δ ρ , then with probability 1 δ, for all

sh, ah, sh+1, it holds that

|n sh,ah nsh,ah| Eρ

2 , |n sh,ah,sh+1 nsh,ah,sh+1| Eρ

Proof of Lemma C.2. The inequalities directly result from the concentration inequality of Gaussian distribution and a union bound.

According to the utility analysis above, we have the following Lemma C.3 giving a high probability bound for |en n|. Lemma C.3. Under the high probability event in Lemma C.2, for all sh, ah, sh+1, it holds that

|ensh,ah nsh,ah| Eρ, |ensh,ah,sh+1 nsh,ah,sh+1| Eρ.

Proof of Lemma C.3. When the event in Lemma C.2 holds, the original counts {nsh,ah,s }s S is a feasible solution to the optimization problem, which means that

max s |ensh,ah,s n sh,ah,s | max s |nsh,ah,s n sh,ah,s | Eρ

Due to the second part of (14), it holds that for any sh, ah, sh+1,

|ensh,ah,sh+1 nsh,ah,sh+1| |ensh,ah,sh+1 n sh,ah,sh+1| + |n sh,ah,sh+1 nsh,ah,sh+1| Eρ.

For the second part, because of the constraints in the optimization problem, it holds that

|ensh,ah n sh,ah| Eρ

Due to the first part of (14), it holds that for any sh, ah,

|ensh,ah nsh,ah| |ensh,ah n sh,ah| + |n sh,ah nsh,ah| Eρ.

Let the non-private empirical estimate be:

b Ph(s |sh, ah) = nsh,ah,s

nsh,ah , (15)

if nsh,ah > 0 and b Ph(s |sh, ah) = 1

S otherwise. We will show that the private transition kernel e P is close to b P by the Lemma C.4 and Lemma C.5 below. Lemma C.4. Under the high probability event of Lemma C.3, for sh, ah, if ensh,ah 3Eρ, it holds that e Ph( |sh, ah) b Ph( |sh, ah) 1 5SEρ

ensh,ah . (16)

Proof of Lemma C.4. If ensh,ah 3Eρ and the conclusion in Lemma C.3 hold, we have e Ph( |sh, ah) b Ph( |sh, ah) 1 X

e Ph(s |sh, ah) b Ph(s |sh, ah)

ensh,ah,s + Eρ

ensh,ah Eρ ensh,ah,s

1 ensh,ah + 2Eρ en2sh,ah

(ensh,ah,s + Eρ) ensh,ah,s

ensh,ah + 2Eρ ensh,ah + 2SE2 ρ en2sh,ah

The second inequality is because ensh,ah,s Eρ

ensh,ah+Eρ nsh,ah,s

nsh,ah ensh,ah,s +Eρ

ensh,ah Eρ and ensh,ah,s +Eρ

ensh,ah ensh,ah,s

ensh,ah ensh,ah,s Eρ

ensh,ah+Eρ . The third inequality is because of Lemma E.6. The last inequality is because ensh,ah 3Eρ.

Lemma C.5. Let V RS be any function with V H, under the high probability event of Lemma C.3, for sh, ah, if ensh,ah 3Eρ, it holds that

Var b Ph( |sh,ah)(V ) q

Var e Ph( |sh,ah)(V ) 4H

SEρ ensh,ah . (18)

Proof of Lemma C.5. For sh, ah such that ensh,ah 3Eρ, we use e P( ) and b P( ) instead of e Ph( |sh, ah) and b Ph( |sh, ah) for simplicity. Because of Lemma C.4, we have e P( ) b P( ) 1 5SEρ

Therefore, it holds that q

Var b P ( )(V ) q

Var e P ( )(V ) q

|Var b P ( )(V ) Var e P ( )(V )|

b P(s ) e P(s ) V (s )2 +

h b P(s ) + e P(s ) i V (s )

b P(s ) e P(s ) V (s )

H2 e P( ) b P( ) 1 + 2H2 e P( ) b P( ) 1

SEρ ensh,ah .

(19) The second inequality is due to the definition of variance.

C.3.2 Validity of our pessimistic penalty

Now we are ready to present the key lemma (Lemma C.6) below to justify our use of Γ as the pessimistic penalty.

Lemma C.6. Under the high probability event of Lemma C.3, with probability 1 δ, for any sh, ah, if ensh,ah 3Eρ (which implies nsh,ah > 0), it holds that

( e Ph Ph) e Vh+1(sh, ah)

v u u t2Var e Ph( |sh,ah)(e Vh+1( )) ι

ensh,ah Eρ + 16SHEρ ι

ensh,ah , (20)

where e V is the private version of estimated V function, which appears in Algorithm 1 and ι = log(HSA/δ).

Proof of Lemma C.6.

( e Ph Ph) e Vh+1(sh, ah) ( e Ph b Ph) e Vh+1(sh, ah) + ( b Ph Ph) e Vh+1(sh, ah)

H e Ph( |sh, ah) b Ph( |sh, ah) 1 + ( b Ph Ph) e Vh+1(sh, ah)

ensh,ah + ( b Ph Ph) e Vh+1(sh, ah) ,

where the third inequality is due to Lemma C.4.

Next, recall bπh+1 in Algorithm 1 is computed backwardly therefore only depends on sample tuple from time h + 1 to H. As a result, e Vh+1 = Qh+1, bπh+1 also only depends on the sample tuple from time h + 1 to H and some Gaussian noise that is independent to the offline dataset. On the other side, by the definition, b Ph only depends on the sample tuples from time h to h + 1. Therefore e Vh+1 and b Ph are Conditionally independent (This trick is also used in [Yin et al., 2021] and [Yin and Wang,

2021b]), by Empirical Bernstein s inequality (Lemma E.4) and a union bound, with probability 1 δ, for all sh, ah such that ensh,ah 3Eρ,

( b Ph Ph) e Vh+1(sh, ah)

v u u t2Var b Ph( |sh,ah)(e Vh+1( )) ι

nsh,ah + 7H ι

3nsh,ah . (22)

Therefore, we have

( e Ph Ph) e Vh+1(sh, ah)

v u u t2Var b Ph( |sh,ah)(e Vh+1( )) ι

nsh,ah + 7H ι

3nsh,ah + 5SHEρ

v u u t2Var b Ph( |sh,ah)(e Vh+1( )) ι

nsh,ah + 9SHEρ ι

v u u t2Var e Ph( |sh,ah)(e Vh+1( )) ι

SEρ ι ensh,ah nsh,ah

v u u t2Var e Ph( |sh,ah)(e Vh+1( )) ι

nsh,ah + 16SHEρ ι

v u u t2Var e Ph( |sh,ah)(e Vh+1( )) ι

ensh,ah Eρ + 16SHEρ ι

The second and forth inequality is because when ensh,ah 3Eρ, nsh,ah 2ensh,ah

3 . Specifically, these two inequalities are also because usually we only care about the case when SEρ 1, which is equivalent to ρ being not very large. The third inequality is due to Lemma C.5. The last inequality is due to Lemma C.3.

Note that the previous Lemmas rely on the condition that en is not very small (ensh,ah 3Eρ). Below we state the Multiplicative Chernoff bound (Lemma C.7 and Remark C.8) to show that under our condition in Theorem 3.4, for (sh, ah) Ch, ensh,ah will be larger than 3Eρ with high probability. Lemma C.7 (Lemma B.1 in [Yin and Wang, 2021b]). For any 0 < δ < 1, there exists an absolute constant c1 such that when total episode n > c1 1/ dm log(HSA/δ), then with probability 1 δ, h [H] nsh,ah n dµ h(sh, ah)/2, (sh, ah) Ch. Furthermore, we denote E := {nsh,ah n dµ h(sh, ah)/2, (sh, ah) Ch, h [H].} (24) then equivalently P(E) > 1 δ.

In addition, we denote

E := {nsh,ah 3

2n dµ h(sh, ah), (sh, ah) Ch, h [H].} (25)

then similarly P(E ) > 1 δ. Remark C.8. According to Lemma C.7, for any failure probability δ, there exists some constant c1 > 0 such that when n c1Eρ ι

dm , with probability 1 δ, for all (sh, ah) Ch, nsh,ah 4Eρ. Therefore, under the condition of Theorem 3.4 and the high probability events in Lemma C.3 and Lemma C.7, it holds that for all (sh, ah) Ch, ensh,ah 3Eρ while for all (sh, ah) / Ch, ensh,ah Eρ.

Lemma C.9. Define (Th V )( , ) := rh( , ) + (Ph V )( , ) for any V RS. Note bπ, Qh, e Vh are defined in Algorithm 1 and denote ξh(s, a) = (Th e Vh+1)(s, a) Qh(s, a). Then it holds that

V π 1 (s) V bπ 1 (s)

h=1 Eπ [ξh(sh, ah) | s1 = s]

h=1 Ebπ [ξh(sh, ah) | s1 = s] . (26)

Furthermore, (26) holds for all V π h (s) V bπ h (s).

Proof of Lemma C.9. Lemma C.9 is a direct corollary of Lemma E.8 with π = π , b Qh = Qh, b Vh = e Vh and bπ = bπ in Algorithm 1, we can obtain this result since by the definition of bπ in Algorithm 1, Qh (sh, ) , πh ( |sh) bπh ( |sh) 0. The proof for V π h (s) V bπ h (s) is identical.

Next we prove the asymmetric bound for ξh, which is the key to the proof. Lemma C.10 (Private version of Lemma D.6 in [Yin and Wang, 2021b]). Denote ξh(s, a) = (Th e Vh+1)(s, a) Qh(s, a), where e Vh+1 and Qh are the quantities in Algorithm 1 and Th(V ) := rh + Ph V for any V RS. Then under the high probability events in Lemma C.3 and Lemma C.6, for any h, sh, ah such that ensh,ah > 3Eρ, we have

0 ξh(sh, ah) = (Th e Vh+1)(sh, ah) Qh(sh, ah)

v u u t2Var e Ph( |sh,ah)(e Vh+1( )) ι

ensh,ah Eρ + 32SHEρ ι

where ι = log(HSA/δ).

Proof of Lemma C.10. The first inequality: We first prove ξh(sh, ah) 0 for all (sh, ah), such that ensh,ah 3Eρ.

Indeed, if b Qp h(sh, ah) < 0, then Qh(sh, ah) = 0. In this case, ξh(sh, ah) = (Th e Vh+1)(sh, ah) 0 (note e Vh 0 by the definition). If b Qp h(sh, ah) 0, then by definition Qh(sh, ah) = min{ b Qp h(sh, ah), H h + 1}+ b Qp h(sh, ah) and this implies

ξh(sh, ah) (Th e Vh+1)(sh, ah) b Qp h(sh, ah)

=(Ph e Ph) e Vh+1(sh, ah) + Γh(sh, ah)

v u u t2Var e Ph( |sh,ah)(e Vh+1( )) ι

ensh,ah Eρ 16SHEρ ι

ensh,ah + Γh(sh, ah) = 0,

where the second inequality uses Lemma C.6, and the last equation uses Line 5 of Algorithm 1.

The second inequality: Then we prove ξh(sh, ah) 2

2Var e Ph( |sh,ah)(e Vh+1( )) ι

ensh,ah Eρ + 32SHEρ ι

ensh,ah for

all (sh, ah) such that ensh,ah 3Eρ.

First, since by construction e Vh H h + 1 for all h [H], this implies

b Qp h = e Qh Γh e Qh = rh + ( e Ph e Vh+1) 1 + (H h) = H h + 1

which is because rh 1 and e Ph is a probability distribution. Therefore, we have the equivalent definition

Qh := min{ b Qp h, H h + 1}+ = max{ b Qp h, 0} b Qp h. Then it holds that

ξh(sh, ah) = (Th e Vh+1)(sh, ah) Qh(sh, ah) (Th e Vh+1)(sh, ah) b Qp h(sh, ah)

=(Th e Vh+1)(sh, ah) e Qh(sh, ah) + Γh(sh, ah)

=(Ph e Ph) e Vh+1(sh, ah) + Γh(sh, ah)

v u u t2Var e Ph( |sh,ah)(e Vh+1( )) ι

ensh,ah Eρ + 16SHEρ ι

ensh,ah + Γh(sh, ah)

v u u t2Var e Ph( |sh,ah)(e Vh+1( )) ι

ensh,ah Eρ + 32SHEρ ι

The proof is complete by combining the two parts.

C.3.3 Reduction to augmented absorbing MDP

Before we prove the theorem, we need to construct an augmented absorbing MDP to bridge e V and V . According to Line 5 in Algorithm 1, the locations with ensh,ah Eρ is heavily penalized with penalty of order e O(H). Therefore we can prove that under the high probability event in Remark C.8, dbπ h(sh, ah) > 0 only if dµ h(sh, ah) > 0 by induction, where bπ is the output of Algorithm 1. The conclusion holds for h = 1. Assume it holds for some h > 1 that dbπ h(sh, ah) > 0 only if dµ h(sh, ah) > 0, then for any sh+1 S such that dbπ h+1(sh+1) > 0, it holds that dµ h+1(sh+1) > 0, which leads to the conclusion that dbπ h+1(sh+1, ah+1) > 0 only if dµ h+1(sh+1, ah+1) > 0. To summarize, we have

dπ0 h (sh, ah) > 0 only if dµ h(sh, ah) > 0, π0 {π , bπ}. (27)

Let us define M by adding one absorbing state s h for all h {2, . . . , H}, therefore the augmented state space S = S {s h} and the transition and reward is defined as follows: (recall Ch := {(sh, ah) : dµ h(sh, ah) > 0})

P h( | sh, ah) =

( Ph( | sh, ah) sh, ah Ch, δs h+1 sh = s h or sh, ah / Ch, r h(sh, ah) = rh(sh, ah) sh, ah Ch 0 sh = s h or sh, ah / Ch

and we further define for any π,

V π h (s) = E π

, v π = E π

h [H], (28)

where E means taking expectation under the absorbing MDP M .

Note that because π and bπ are fully covered by µ (27), it holds that

v π = vπ , v bπ = vbπ. (29)

Define (T h V )( , ) := r h( , ) + (P h V )( , ) for any V RS+1. Note bπ, Qh, e Vh are defined in Algorithm 1 (we extend the definition by letting e Vh(s h) = 0 and Qh(s h, ) = 0) and denote ξ h(s, a) = (T h e Vh+1)(s, a) Qh(s, a). Using identical proof to Lemma C.9, we have

V π 1 (s) V bπ 1 (s)

h=1 E π h ξ h(sh, ah) | s1 = s i

h=1 E bπ h ξ h(sh, ah) | s1 = s i , (30)

where V π 1 is defined in (28). Furthermore, (30) holds for all V π

h (s) V bπ h (s).

C.3.4 Finalize our result with non-private statistics

For those (sh, ah) Ch, ξ h(sh, ah) = rh(sh, ah) + Ph e Vh+1(sh, ah) Qh(sh, ah) = ξh(sh, ah). For those (sh, ah) / Ch or sh = s h, we have ξ h(sh, ah) = 0.

Therefore, by (30) and Lemma C.10, under the high probability events in Lemma C.3, Lemma C.6 and Lemma C.7, we have for all t [H], s S (S does not include the absorbing state s t),

V π t (s) V bπ t (s)

h=t E π h ξ h(sh, ah) | st = s i

h=t E bπ h ξ h(sh, ah) | st = s i

h=t E π h ξ h(sh, ah) | st = s i 0

v u u t2Var e Ph( |sh,ah)(e Vh+1( )) ι

ensh,ah Eρ + 32SHEρ ι

ensh,ah | st = s

1 ((sh, ah) Ch)

v u u t2Var e Ph( |sh,ah)(e Vh+1( )) ι

nsh,ah 2Eρ + 32SHEρ ι

nsh,ah Eρ | st = s

1 ((sh, ah) Ch)

v u u t Var e Ph( |sh,ah)(e Vh+1( )) ι

nsh,ah + 128SHEρ ι

3nsh,ah | st = s

1 ((sh, ah) Ch)

v u u t2Var e Ph( |sh,ah)(e Vh+1( )) ι

ndµ h(sh, ah) + 256SHEρ ι

3ndµ h(sh, ah) | st = s

1 ((sh, ah) Ch)

The second and third inequality are because of Lemma C.10, Remark C.8 and the the fact that either ξ = 0 or ξ = ξ while (sh, ah) Ch. The forth inequality is due to Lemma C.3. The fifth inequality is because of Remark C.8. The last inequality is by Lemma C.7.

Below we present a crude bound of V π t (s) e Vt(s) , which can be further used to bound the main term in the main result.

Lemma C.11 (Self-bounding, private version of Lemma D.7 in [Yin and Wang, 2021b]). Under the high probability events in Lemma C.3, Lemma C.6 and Lemma C.7, it holds that for all t [H] and s S, V π t (s) e Vt(s) 4

n dm + 256SH2Eρ ι

where dm is defined in Theorem 3.4.

Proof of Lemma C.11. According to (31), since Var e Ph( |sh,ah)(e Vh+1( )) H2, we have for all t [H], V π t (s) V bπ t (s) 4

n dm + 256SH2Eρ ι

Next, apply Lemma E.7 by setting π = bπ, π = π , b Q = Q, b V = e V under M , then we have

V π t (s) e Vt(s) =

h=t E π h ξ h(sh, ah) | st = s i +

h=t E π Qh (sh, ) , π h ( |sh) bπh ( |sh) | st = s

h=t E π h ξ h(sh, ah) | st = s i

n dm + 256SH2Eρ ι

Also, apply Lemma E.7 by setting π = π = bπ, b Q = Q, b V = e V under M , then we have

e Vt(s) V bπ t (s) =

h=t E bπ h ξ h(sh, ah) | st = s i 0. (34)

The proof is complete by combing (32), (33) and (34).

Now we are ready to bound q

Var e Ph( |sh,ah)(e Vh+1( )) by q

Var Ph( |sh,ah)(V h+1( )). Under the high probability events in Lemma C.3, Lemma C.6 and Lemma C.7, with probability 1 δ, it holds that for all (sh, ah) Ch, q

Var e Ph( |sh,ah)(e Vh+1( )) q

Var e Ph( |sh,ah)(V h+1( )) + e Vh+1 V π

Var e Ph( |sh,ah)(V h+1( )) + 4

n dm + 256SH2Eρ ι

Var b Ph( |sh,ah)(V h+1( )) + 4

n dm + 256SH2Eρ ι

SEρ ensh,ah

Var b Ph( |sh,ah)(V h+1( )) + 4

n dm + 256SH2Eρ ι

Var Ph( |sh,ah)(V h+1( )) + 4

n dm + 256SH2Eρ ι

SEρ n dm + 3H r ι n dm

Var Ph( |sh,ah)(V h+1( )) + 9 ιH2

n dm + 256SH2Eρ ι

The second inequality is because of Lemma C.11. The third inequality is due to Lemma C.5. The forth inequality comes from Lemma C.3 and Remark C.8. The fifth inequality holds with probability 1 δ because of Lemma E.5 and a union bound.

Finally, by plugging (35) into (31) and averaging over s1, we finally have with probability 1 4δ,

vπ vbπ = v π v bπ

v u u t2Var e Ph( |sh,ah)(e Vh+1( )) ι

ndµ h(sh, ah) + 256SHEρ ι

3ndµ h(sh, ah)

Var Ph( |sh,ah)(V h+1( )) ι ndµ h(sh, ah)

+ e O H3 + SH2Eρ

(sh,ah) Ch dπ h (sh, ah)

Var Ph( |sh,ah)(V h+1( )) ι ndµ h(sh, ah) + e O H3 + SH2Eρ

(sh,ah) Ch dπ h (sh, ah)

Var Ph( |sh,ah)(V h+1( )) ι ndµ h(sh, ah) + e O H3 + SH2Eρ

where e O absorbs constants and Polylog terms. The first equation is due to (29). The first inequality is because of (31). The second inequality comes from (35) and our assumption that n dm c1H2. The second equation uses the fact that dπ h (sh, ah) = d π

h (sh, ah), for all (sh, ah). The last equation is because for any (sh, ah, sh+1) such that dπ h (sh, ah) > 0 and Ph(sh+1|sh, ah) > 0, V h+1(sh+1) = V h+1(sh+1).

C.4 Put everything together

Combining Lemma C.1 and (36), the proof of Theorem 3.4 is complete.

D Proof of Theorem 4.1

D.1 Proof sketch

Since the whole proof for privacy guarantee is not very complex, we present it in Section D.2 below and only sketch the proof for suboptimality bound.

First of all, by extended value difference (Lemma E.7 and E.8), we can convert bounding the suboptimality gap of v vbπ to bounding PH h=1 Eπ [Γh(sh, ah)], given that |(Th e Vh+1 e Th e Vh+1)(s, a)| Γh(s, a) for all s, a, h. To bound (Th e Vh+1 e Th e Vh+1)(s, a), according to our analysis about the upper bound of the noises we add, we can decompose (Th e Vh+1 e Th e Vh+1)(s, a) to lower order terms ( e O( 1

K )) and the following key quantity:

ϕ(s, a) bΛ 1 h

τ=1 ϕ (sτ h, aτ h) rτ h + e Vh+1 sτ h+1 Th e Vh+1 (sτ h, aτ h) /eσ2 h(sτ h, aτ h)

For the term above, we prove an upper bound of σ2 e Vh+1 eσ2 h , so we can convert eσ2 h to σ2 e Vh+1.

Next, since Var h rτ h + e Vh+1 sτ h+1 Th e Vh+1 (sτ h, aτ h) | sτ h, aτ h i σ2 e Vh+1, we can apply Bernstein s inequality for self-normalized martingale (Lemma E.10) as in Yin et al. [2022] for deriving tighter bound.

Finally, we replace the private statistics by non-private ones. More specifically, we convert σ2 e Vh+1 to

σ 2 h (Λ 1 h to Λ 1 h ) by combining the crude upper bound of e V V and matrix concentrations.

D.2 Proof of the privacy guarantee

The privacy guarantee of DP-VAPVI (Algorithm 2) is summarized by Lemma D.1 below.

Lemma D.1 (Privacy analysis of DP-VAPVI (Algorithm 2)). DP-VAPVI (Algorithm 2) satisfies ρ-z CDP.

Proof of Lemma D.1. For PK τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2, the ℓ2 sensitivity is 2H2. For PK τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1) and PK τ=1 ϕ (sτ h, aτ h) rτ h + e Vh+1 sτ h+1 /eσ2 h(sτ h, aτ h), the ℓ2 sensitivity is 2H. Therefore according to Lemma 2.7, the use of Gaussian Mechanism (the additional noises ϕ1, ϕ2, ϕ3) ensures ρ0-z CDP for each counter. For PK τ=1 ϕ( sτ h, aτ h)ϕ( sτ h, aτ h) + λI and PK τ=1 ϕ (sτ h, aτ h) ϕ (sτ h, aτ h) /eσ2 h(sτ h, aτ h) + λI, according to Appendix D in [Redberg and Wang, 2021], the per-instance ℓ2 sensitivity is

2 sup ϕ: ϕ 2 1

2 sup ϕ: ϕ 2 1

i,j ϕ2 i ϕ2 j = 1

Therefore the use of Gaussian Mechanism (the additional noises K1, K2) also ensures ρ0-z CDP for each counter.12 Combining these results, according to Lemma E.17, the whole algorithm satisfies 5Hρ0 = ρ-z CDP.

D.3 Proof of the sub-optimality bound

D.3.1 Utility analysis and some preparation

We begin with the following high probability bound of the noises we add.

12For more detailed explanation, we refer the readers to Appendix D of [Redberg and Wang, 2021].

Lemma D.2 (Utility analysis). Let L = 2H q

d ρ0 log( 10Hd

5Hd log( 10Hd

2 + log(5c1H/δ)

2 + log(5c1H/δ)

3 for some universal constants

c1, c2. Then with probability 1 δ, the following inequalities hold simultaneously:

For all h [H], ϕ1 2 HL, ϕ2 2 L, ϕ3 2 L. For all h [H], K1, K2 are symmetric and positive definite and Ki 2 E, i {1, 2}. (38)

Proof of Lemma D.2. The second line of (38) results from Lemma 19 in [Redberg and Wang, 2021] and Weyl s Inequality. The first line of (38) directly results from the concentration inequality for Guassian distribution and a union bound.

Define the Bellman update error ζh(s, a) := (Th e Vh+1)(s, a) b Qh(s, a) and recall bπh(s) = argmaxπh b Qh(s, ), πh( | s) A, then because of Lemma E.8,

V π 1 (s) V bπ 1 (s)

h=1 Eπ [ζh(sh, ah) | s1 = s]

h=1 Ebπ [ζh(sh, ah) | s1 = s] . (39)

Define e Th e Vh+1( , ) = ϕ( , ) ewh. Then similar to Lemma C.10, we have the following lemma showing that in order to bound the sub-optimality, it is sufficient to bound the pessimistic penalty.

Lemma D.3 (Lemma C.1 in [Yin et al., 2022]). Suppose with probability 1 δ, it holds for all s, a, h S A [H] that |(Th e Vh+1 e Th e Vh+1)(s, a)| Γh(s, a), then it implies s, a, h S A [H], 0 ζh(s, a) 2Γh(s, a). Furthermore, with probability 1 δ, it holds for any policy π simultaneously,

V π 1 (s) V bπ 1 (s)

h=1 2 Eπ [Γh(sh, ah) | s1 = s] .

Proof of Lemma D.3. We first show given |(Th e Vh+1 e Th e Vh+1)(s, a)| Γh(s, a), then 0 ζh(s, a) 2Γh(s, a), s, a, h S A [H].

Step 1: The first step is to show 0 ζh(s, a), s, a, h S A [H].

Indeed, if Qh(s, a) 0, then by definition b Qh(s, a) = 0 and therefore ζh(s, a) := (Th e Vh+1)(s, a) b Qh(s, a) = (Th e Vh+1)(s, a) 0. If Qh(s, a) > 0, then b Qh(s, a) Qh(s, a) and

ζh(s, a) :=(Th e Vh+1)(s, a) b Qh(s, a) (Th e Vh+1)(s, a) Qh(s, a)

=(Th e Vh+1)(s, a) (e Th e Vh+1)(s, a) + Γh(s, a) 0.

Step 2: The second step is to show ζh(s, a) 2Γh(s, a), s, a, h S A [H].

Under the assumption that |(Th e Vh+1 e Th e Vh+1)(s, a)| Γh(s, a), we have

Qh(s, a) = (e Th e Vh+1)(s, a) Γh(s, a) (Th e Vh+1)(s, a) H h + 1,

which implies that b Qh(s, a) = max( Qh(s, a), 0). Therefore, it holds that

ζh(s, a) :=(Th e Vh+1)(s, a) b Qh(s, a) (Th e Vh+1)(s, a) Qh(s, a)

=(Th e Vh+1)(s, a) (e Th e Vh+1)(s, a) + Γh(s, a) 2 Γh(s, a).

For the last statement, denote F := {0 ζh(s, a) 2Γh(s, a), s, a, h S A [H]}. Note conditional on F, then by (39), V π 1 (s) V bπ 1 (s) PH h=1 2 Eπ[Γh(sh, ah) | s1 = s] holds for any

policy π almost surely. Therefore,

π, V π 1 (s) V bπ 1 (s)

h=1 2 Eπ[Γh(sh, ah) | s1 = s].

π, V π 1 (s) V bπ 1 (s)

h=1 2 Eπ[Γh(sh, ah) | s1 = s]

π, V π 1 (s) V bπ 1 (s)

h=1 2 Eπ[Γh(sh, ah) | s1 = s]

π, V π 1 (s) V bπ 1 (s)

h=1 2 Eπ[Γh(sh, ah) | s1 = s]

P[F] = 1 P[F] 1 δ,

which finishes the proof.

D.3.2 Bound the pessimistic penalty

By Lemma D.3, it remains to bound |(Th e Vh+1)(s, a) (e Th e Vh+1)(s, a)|. Suppose wh is the coefficient corresponding to the Th e Vh+1 (such wh exists by Lemma E.14), i.e. Th e Vh+1 = ϕ wh, and recall (e Th e Vh+1)(s, a) = ϕ(s, a) ewh, then:

Th e Vh+1 (s, a) e Th e Vh+1 (s, a) = ϕ(s, a) (wh ewh)

=ϕ(s, a) wh ϕ(s, a) eΛ 1 h

τ=1 ϕ (sτ h, aτ h) rτ h + e Vh+1 sτ h+1 /eσ2 h(sτ h, aτ h) + ϕ3

= ϕ(s, a) wh ϕ(s, a) bΛ 1 h

τ=1 ϕ (sτ h, aτ h) rτ h + e Vh+1 sτ h+1 /eσ2 h(sτ h, aτ h)

ϕ(s, a) bΛ 1 h ϕ3 | {z } (ii)

+ ϕ(s, a) (bΛ 1 h eΛ 1 h )

τ=1 ϕ (sτ h, aτ h) rτ h + e Vh+1 sτ h+1 /eσ2 h(sτ h, aτ h) + ϕ3

| {z } (iii)

(40) where bΛh = eΛh K2 = PK τ=1 ϕ (sτ h, aτ h) ϕ (sτ h, aτ h) /eσ2 h(sτ h, aτ h) + λI.

Term (ii) can be handled by the following Lemma D.4

Lemma D.4. Recall κ in Assumption 2.2. Under the high probability event in Lemma D.2, suppose

K max 512H4 log( 2Hd

δ ) κ2 , 4λH2

, then with probability 1 δ, for all s, a, h S A [H], it

holds that ϕ(s, a) bΛ 1 h ϕ3 4H2L/κ

Proof of Lemma D.4. Define eΛp h = Eµ,h[eσ 2 h (s, a)ϕ(s, a)ϕ(s, a) ]. Then because of Assumption 2.2 and eσh H, it holds that λmin(eΛp h) κ H2 . Therefore, due to Lemma E.13, we have with

probability 1 δ, ϕ(s, a) bΛ 1 h ϕ3 ϕ(s, a) bΛ 1 h ϕ3 bΛ 1 h

K ϕ(s, a) (eΛp h) 1 ϕ3 (eΛp h) 1

K (eΛp h) 1

The first inequality is because of Cauchy-Schwarz inequality. The second inequality holds with probability 1 δ due to Lemma E.13 and a union bound. The third inequality holds because

a 2 A 2 a 2 = a 2 p

A 2. The last inequality arises from (eΛp h) 1 = λmax((eΛp h) 1) = λ 1 min(eΛp h) H2

The difference between eΛ 1 h and bΛ 1 h can be bounded by the following Lemma D.5

Lemma D.5. Under the high probability event in Lemma D.2, suppose K 128H4 log 2d H

δ κ2 , then with

probability 1 δ, for all h [H], it holds that bΛ 1 h eΛ 1 h 4H4E/κ2

Proof of Lemma D.5. First of all, we have

bΛ 1 h eΛ 1 h = bΛ 1 h (bΛh eΛh) eΛ 1 h

bΛ 1 h bΛh eΛh eΛ 1 h

λ 1 min(bΛh) λ 1 min(eΛh) E.

The first inequality is because A B A B . The second inequality is due to Lemma D.2.

Let bΛ h = 1

K bΛh, then because of Lemma E.12, with probability 1 δ, it holds that for all h [H], bΛ h Eµ,h[ϕ(s, a)ϕ(s, a) /eσ2 h(s, a)] λ

which implies that when K 128H4 log 2d H

δ κ2 , it holds that (according to Weyl s Inequality)

λmin(bΛ h) λmin(Eµ,h[ϕ(s, a)ϕ(s, a) /eσ2 h(s, a)]) + λ

K κ 2H2 κ 2H2 .

Under this high probability event, we have λmin(bΛh) Kκ

2H2 and therefore λmin(eΛh) λmin(bΛh) Kκ 2H2 . Plugging these two results into (41), we have

bΛ 1 h eΛ 1 h 4H4E/κ2

Then we can bound term (iii) by the following Lemma D.6

Lemma D.6. Suppose K max{ 128H4 log 2d H

dκ }, under the high probability events in Lemma D.2 and Lemma D.5, it holds that for all s, a, h S A [H], ϕ(s, a) (bΛ 1 h eΛ 1 h )

τ=1 ϕ (sτ h, aτ h) rτ h + e Vh+1 sτ h+1 /eσ2 h(sτ h, aτ h) + ϕ3

Proof of Lemma D.6. First of all, the left hand side is bounded by (bΛ 1 h eΛ 1 h )

τ=1 ϕ (sτ h, aτ h) rτ h + e Vh+1 sτ h+1 /eσ2 h(sτ h, aτ h)

! 2 + 4H4EL/κ2

due to Lemma D.5. Then the left hand side can be further bounded by

(bΛ 1 h eΛ 1 h )ϕ (sτ h, aτ h) /eσh(sτ h, aτ h) 2 + 4H4EL/κ2

(bΛ 1 h eΛ 1 h ) ϕ (sτ h, aτ h) ϕ (sτ h, aτ h)

eσ2 h(sτ h, aτ h) (bΛ 1 h eΛ 1 h )

K Tr (bΛ 1 h eΛ 1 h ) bΛh (bΛ 1 h eΛ 1 h ) + 4H4EL/κ2

Kd λmax (bΛ 1 h eΛ 1 h ) bΛh (bΛ 1 h eΛ 1 h ) + 4H4EL/κ2

Kd (bΛ 1 h eΛ 1 h ) bΛh (bΛ 1 h eΛ 1 h ) 2 + 4H4EL/κ2

Kd eΛ 1 h 2 eΛh bΛh 2 bΛ 1 h eΛ 1 h 2 + 4H4EL/κ2

K + 4H4EL/κ2

The first inequality is because a 2 =

Tr(aa ). The second inequality is due to Cauchy-Schwarz inequality. The third inequality is because for positive definite matrix A, it holds that Tr(A) = Pd i=1 λi(A) dλmax(A). The equation is because for symmetric, positive definite matrix A, A 2 = λmax(A). The forth inequality is due to A B A B . The fifth inequality is because of Lemma D.2, Lemma D.5 and the statement in the proof of Lemma D.5 that λmin(eΛh) Kκ

2H2 . The last inequality uses the assumption that K

Now the remaining part is term (i), we have

ϕ(s, a) wh ϕ(s, a) bΛ 1 h

τ=1 ϕ (sτ h, aτ h) rτ h + e Vh+1 sτ h+1 /eσ2 h(sτ h, aτ h)

= ϕ(s, a) wh ϕ(s, a) bΛ 1 h

τ=1 ϕ (sτ h, aτ h) Th e Vh+1 (sτ h, aτ h) /eσ2 h(sτ h, aτ h)

| {z } (iv)

ϕ(s, a) bΛ 1 h

τ=1 ϕ (sτ h, aτ h) rτ h + e Vh+1 sτ h+1 Th e Vh+1 (sτ h, aτ h) /eσ2 h(sτ h, aτ h)

We are able to bound term (iv) by the following Lemma D.7.

Lemma D.7. Recall κ in Assumption 2.2. Under the high probability event in Lemma D.2, suppose

K max 512H4 log( 2Hd

δ ) κ2 , 4λH2

, then with probability 1 δ, for all s, a, h S A [H],

ϕ(s, a) wh ϕ(s, a) bΛ 1 h

τ=1 ϕ (sτ h, aτ h) Th e Vh+1 (sτ h, aτ h) /eσ2 h(sτ h, aτ h)

Proof of Lemma D.7. Recall Th e Vh+1 = ϕ wh and apply Lemma E.13, we obtain with probability 1 δ, for all s, a, h S A [H], ϕ(s, a) wh ϕ(s, a) bΛ 1 h

τ=1 ϕ (sτ h, aτ h) Th e Vh+1 (sτ h, aτ h) /eσ2 h(sτ h, aτ h)

ϕ(s, a) wh ϕ(s, a) bΛ 1 h

τ=1 ϕ (sτ h, aτ h) ϕ(sτ h, aτ h) wh/eσ2 h(sτ h, aτ h)

= ϕ(s, a) wh ϕ(s, a) bΛ 1 h bΛh λI wh

= λ ϕ(s, a) bΛ 1 h wh

λ ϕ(s, a) bΛ 1 h wh bΛ 1 h

K ϕ(s, a) (eΛp h) 1 wh (eΛp h) 1

d (eΛp h) 1

where eΛp h := Eµ,h eσh(s, a) 2ϕ(s, a)ϕ(s, a) . The first inequality applies Cauchy-Schwarz inequality. The second inequality holds with probability 1 δ due to Lemma E.13 and a union bound. The third inequality uses

a 2 A 2 a 2 = a 2 p

A 2 and wh 2H

nally, as λmin(eΛp h) κ maxh,s,a eσh(s,a)2 κ H2 implies (eΛp h) 1 H2

κ , the last inequality holds.

For term (v), denote: xτ = ϕ(sτ h,aτ h) eσh(sτ h,aτ h), ητ = rτ h + e Vh+1 sτ h+1 Th e Vh+1 (sτ h, aτ h) /eσh(sτ h, aτ h),

then by Cauchy-Schwarz inequality, it holds that for all h, s, a [H] S A, ϕ(s, a) bΛ 1 h

τ=1 ϕ (sτ h, aτ h) rτ h + e Vh+1 sτ h+1 Th e Vh+1 (sτ h, aτ h) /eσ2 h(sτ h, aτ h)

ϕ(s, a) bΛ 1 h ϕ(s, a)

ϕ(s, a) bΛ 1 h ϕ(s, a) by q

ϕ(s, a) eΛ 1 h ϕ(s, a) using the following Lemma D.8.

Lemma D.8. Suppose K max{ 128H4 log 2d H

dκ }, under the high probability events in Lemma D.2 and Lemma D.5, it holds that for all s, a, h S A [H], q

ϕ(s, a) bΛ 1 h ϕ(s, a) q

ϕ(s, a) eΛ 1 h ϕ(s, a) + 2H2

Proof of Lemma D.8.

ϕ(s, a) bΛ 1 h ϕ(s, a) = q

ϕ(s, a) eΛ 1 h ϕ(s, a) + ϕ(s, a) (bΛ 1 h eΛ 1 h )ϕ(s, a)

ϕ(s, a) eΛ 1 h ϕ(s, a) + bΛ 1 h eΛ 1 h 2

ϕ(s, a) eΛ 1 h ϕ(s, a) + r bΛ 1 h eΛ 1 h 2

ϕ(s, a) eΛ 1 h ϕ(s, a) + 2H2

The first inequality uses |a Aa| a 2 2 A . The second inequality is because for a, b 0, a +

a + b. The last inequality uses Lemma D.5.

Remark D.9. Similarly, under the same assumption in Lemma D.8, we also have for all s, a, h S A [H], q

ϕ(s, a) eΛ 1 h ϕ(s, a) q

ϕ(s, a) bΛ 1 h ϕ(s, a) + 2H2

D.3.3 An intermediate result: bounding the variance

Before we handle PK τ=1 xτητ bΛ 1 h , we first bound suph eσ2 h σ2 e Vh+1

by the following

Lemma D.10. Lemma D.10 (Private version of Lemma C.7 in [Yin et al., 2022]). Recall the definition of eσh( , )2 = max{1, g Varh e Vh+1( , )} in Algorithm 2 where g Varh e Vh+1 ( , ) = ϕ( , ), eβh [0,(H h+1)2] ϕ( , ), eθh

[0,H h+1] 2 (eβh and eθh are defined in Algorithm 2) and σe Vh+1( , )2 :=

max{1, Var Ph e Vh+1( , )}. Suppose K max 512 log( 2Hd

δ ) κ2 , 4λ

κ , 128 log 2d H

max{ 4L2 H2d3κ, 32E2

d2κ2 , 16λ2

d2κ }, under the high probability event in Lemma D.2, it holds that with probability 1 6δ,

sup h ||eσ2 h σ2 e Vh+1|| 36

κK log (λ + K)2Kd H2

Proof of Lemma D.10. Step 1: The first step is to show for all h, s, a [H] S A, with probability 1 3δ,

ϕ(s, a), eβh [0,(H h+1)2] Ph(e Vh+1)2(s, a) 12

κK log (λ + K)2Kd H2

Proof of Step 1. We can bound the left hand side by the following decomposition: ϕ(s, a), eβh [0,(H h+1)2] Ph(e Vh+1)2(s, a) ϕ(s, a), eβh Ph(e Vh+1)2(s, a)

ϕ(s, a) eΣ 1 h

τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2 + ϕ1

Ph(e Vh+1)2(s, a)

ϕ(s, a) Σ 1 h

τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2 !

Ph(e Vh+1)2(s, a)

+ ϕ(s, a) Σ 1 h ϕ1 | {z } (2)

ϕ(s, a) (eΣ 1 h Σ 1 h )

τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2 + ϕ1

! | {z } (3)

where Σh = eΣh K1 = PK τ=1 ϕ( sτ h, aτ h)ϕ( sτ h, aτ h) + λI.

Similar to the proof in Lemma D.5, when K max{ 128 log 2d H

dκ}, it holds that with probability 1 δ, for all h [H],

λmin( Σh) Kκ

2 , λmin(eΣh) Kκ

2 , eΣ 1 h Σ 1 h 2 4E/κ2

(The only difference to Lemma D.5 is here Eµ,h[ϕ(s, a)ϕ(s, a) ] κ.)

Under this high probability event, for term (2), it holds that for all h, s, a [H] S A, ϕ(s, a) Σ 1 h ϕ1 ϕ(s, a) Σ 1 h ϕ1 λ 1 min( Σh) HL 2HL/κ

For term (3), similar to Lemma D.6, we have for all h, s, a [H] S A, ϕ(s, a) (eΣ 1 h Σ 1 h )

τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2 + ϕ1

(The only difference to Lemma D.6 is that here e Vh+1(s)2 H2, ϕ1 2 HL, eΣ 1 h 2 2 Kκ and eΣ 1 h Σ 1 h 2 4E/κ2

We further decompose term (1) as below.

ϕ(s, a) Σ 1 h

τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2 !

Ph(e Vh+1)2(s, a)

ϕ(s, a) Σ 1 h

τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2 ϕ(s, a) Σ 1 h (

τ=1 ϕ( sτ h, aτ h)ϕ( sτ h, aτ h) + λI) Z

S (e Vh+1)2(s )dνh(s )

ϕ(s, a) Σ 1 h

τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2 Ph(e Vh+1)2( sτ h, aτ h) | {z } (4)

+ λ ϕ(s, a) Σ 1 h

S (e Vh+1)2(s )dνh(s ) | {z } (5)

For term (5), because K max 512 log( 2Hd

δ ) κ2 , 4λ

, by Lemma E.13 and a union bound, with

probability 1 δ, for all h, s, a [H] S A,

λ ϕ(s, a) Σ 1 h

S (e Vh+1)2(s )dνh(s ) λ ϕ(s, a) Σ 1 h

S (e Vh+1)2(s )dνh(s ) Σ 1 h

K ϕ(s, a) (Σp h) 1 2

S (e Vh+1)2(s )dνh(s ) (Σp h) 1 4λ (Σp h) 1 H2

(48) where Σp h = Eµ,h[ϕ(s, a)ϕ(s, a) ] and λmin(Σp h) κ.

For term (4), it can be bounded by the following inequality (because of Cauchy-Schwarz inequality).

(4) ϕ(s, a) Σ 1 h

τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2 Ph(e Vh+1)2( sτ h, aτ h) Σ 1 h

Bounding using covering. Note for any fix Vh+1, we can define xτ = ϕ( sτ h, aτ h) ( ϕ 2 1) and ητ = Vh+1( sτ h+1)2 Ph(Vh+1)2( sτ h, aτ h) is H2-subgaussian, by Lemma E.9 (where t = K and L = 1), it holds that with probability 1 δ,

τ=1 ϕ( sτ h, aτ h) Vh+1( sτ h+1)2 Ph(Vh+1)2( sτ h, aτ h) Σ 1 h

2 log λ + K

Let Nh(ϵ) be the minimal ϵ-cover (with respect to the supremum norm) of

Vh := Vh : Vh( ) = maxa A{min{ϕ(s, a) θ C1 q

d ϕ( , ) eΛ 1 h ϕ( , ) C2, H h + 1}+} .

That is, for any V Vh, there exists a value function V Nh(ϵ) such that sups S |V (s) V (s)| < ϵ. Now by a union bound, we obtain with probability 1 δ,

sup Vh+1 Nh+1(ϵ)

τ=1 ϕ( sτ h, aτ h) Vh+1( sτ h+1)2 Ph(Vh+1)2( sτ h, aτ h) Σ 1 h

2 log λ + K

λδ |Nh+1(ϵ)|

which implies

τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2 Ph(e Vh+1)2( sτ h, aτ h) Σ 1 h

2 log λ + K

λδ |Nh+1(ϵ)| + 4H2p

choosing ϵ = d

λ/K, applying Lemma B.3 of [Jin et al., 2021]13 to the covering number Nh+1(ϵ) w.r.t. Vh+1, we can further bound above by

2 log λ + K

λδ 2d HK + 4H2

H4 d3 log λ + K

Apply a union bound for h [H], we have with probability 1 δ, for all h [H],

τ=1 ϕ( sτ h, aτ h) e Vh+1( sτ h+1)2 Ph(e Vh+1)2( sτ h, aτ h) Σ 1 h

H4d3 log (λ + K)2Kd H2

(50) and similar to term (2), it holds that for all h, s, a [H] S A,

ϕ(s, a) Σ 1 h q Σ 1 h

2 κK . (51)

Combining (45), (46), (47), (48), (49), (50), (51) and the assumption that K max{ 4L2 H2d3κ, 32E2

d2κ2 , 16λ2

d2κ }, we obtain with probability 1 3δ for all h, s, a [H] S A,

ϕ(s, a), eβh [0,(H h+1)2] Ph(e Vh+1)2(s, a) 12

κK log (λ + K)2Kd H2

Step 2: The second step is to show for all h, s, a [H] S A, with probability 1 3δ,

ϕ(s, a), eθh [0,H h+1] Ph(e Vh+1)(s, a) 12

κK log (λ + K)2Kd H2

The proof of Step 2 is nearly identical to Step 1 except e V 2 h is replaced by e Vh.

Step 3: The last step is to prove suph||eσ2 h σ2 e Vh+1|| 36 r

κK log (λ+K)2Kd H2

λδ with high

probability.

Proof of Step 3. By (52), ϕ( , ), eθh

[0,H h+1] 2 Ph(e Vh+1)(s, a) 2

= ϕ(s, a), eθh [0,H h+1] + Ph(e Vh+1)(s, a) ϕ(s, a), eθh [0,H h+1] Ph(e Vh+1)(s, a)

2H ϕ(s, a), eθh [0,H h+1] Ph(e Vh+1)(s, a) 24

κK log (λ + K)2Kd H2

Combining this with Step 1, we have with probability 1 6δ, h, s, a [H] S A, g Varh e Vh+1(s, a) Var Ph e Vh+1(s, a) 36

κK log (λ + K)2Kd H2

Finally, by the non-expansiveness of operator max{1, }, the proof is complete.

13Note that the conclusion in [Jin et al., 2021] hold here even though we have an extra constant C2.

D.3.4 Validity of our pessimism

Recall the definition bΛh = PK τ=1 ϕ (sτ h, aτ h) ϕ (sτ h, aτ h) /eσ2 h(sτ h, aτ h) + λ I and Λh = PK τ=1 ϕ(sτ h, aτ h)ϕ(sτ h, aτ h) /σ2 e Vh+1(sτ h, aτ h)+λI. Then we have the following lemma to bound

ϕ(s, a) bΛ 1 h ϕ(s, a) by q

ϕ(s, a) Λ 1 h ϕ(s, a).

Lemma D.11 (Private version of lemma C.3 in [Yin et al., 2022]). Denote the quantities C1 = max{2λ, 128 log(2d H/δ), 128H4 log(2d H/δ)

κ2 } and C2 = e O(H12d3/κ5). Suppose the number of episode K satisfies K > max{C1, C2} and the condition in Lemma D.10, under the high probability events in Lemma D.2 and Lemma D.10, it holds that with probability 1 2δ, for all h, s, a [H] S A, q

ϕ(s, a) bΛ 1 h ϕ(s, a) 2 q

ϕ(s, a) Λ 1 h ϕ(s, a).

Proof of Lemma D.11. By definition q

ϕ(s, a) bΛ 1 h ϕ(s, a) = ϕ(s, a) bΛ 1 h . Then denote

K bΛh, Λ h = 1

where Λh = PK τ=1 ϕ(sτ h, aτ h)ϕ(sτ h, aτ h) /σ2 e Vh+1(sτ h, aτ h) + λI. Under the assumption of K, by the conclusion in Lemma D.10, we have bΛ h Λ h sup s,a

ϕ(s, a)ϕ(s, a)

eσ2 h(s, a) ϕ(s, a)ϕ(s, a)

σ2 e Vh+1(s, a)

eσ2 h(s, a) σ2 e Vh+1(s, a)

eσ2 h(s, a) σ2 e Vh+1(s, a)

eσ2 h(s, a) σ2 e Vh+1(s, a)

κK log (λ + K)2Kd H2

Next by Lemma E.12 (with ϕ to be ϕ/σe Vh+1 and therefore C = 1) and a union bound, it holds with probability 1 δ, for all h [H], Λ h Eµ,h[ϕ(s, a)ϕ(s, a) /σ2 e Vh+1(s, a)] + λ

Therefore by Weyl s inequality and the assumption that K satisfies that K > max{2λ, 128 log(2d H/δ), 128H4 log(2d H/δ)

κ2 }, the above inequality leads to

Λ h =λmax(Λ h) λmax Eµ,h[ϕ(s, a)ϕ(s, a) /σ2 e Vh+1(s, a)] + λ

= Eµ,h[ϕ(s, a)ϕ(s, a) /σ2 e Vh+1(s, a)] 2 + λ

ϕ(s, a) 2 + λ

λmin(Λ h) λmin Eµ,h[ϕ(s, a)ϕ(s, a) /σ2 e Vh+1(s, a)] + λ

λmin Eµ,h[ϕ(s, a)ϕ(s, a) /σ2 e Vh+1(s, a)] 4

1/2 κ 2H2 .

Hence with probability 1 δ, Λ h 2 and Λ 1 h = λ 1 min(Λ h) 2H2

κ . Similarly, one can show bΛ 1 h 2H2

κ with probability 1 δ using identical proof.

Now apply Lemma E.11 and a union bound to bΛ h and Λ h, we obtain with probability 1 δ, for all h, s, a [H] S A,

ϕ(s, a) bΛ 1 h

r Λ 1 h Λ h bΛ 1 h bΛ h Λ h

ϕ(s, a) Λ 1 h

ϕ(s, a) Λ 1 h

v u u t288H4

κK log (λ + K)2Kd H2

ϕ(s, a) Λ 1 h

2 ϕ(s, a) Λ 1 h

where the third inequality uses (53) and the last inequality uses K > e O(H12d3/κ5). Note the conclusion can be derived directly by the above inequality multiplying 1/

K on both sides.

In order to bound PK τ=1 xτητ bΛ 1 h , we apply the following Lemma D.12.

Lemma D.12 (Lemma C.4 in [Yin et al., 2022]). Recall xτ = ϕ(sτ h,aτ h) eσh(sτ h,aτ h) and

ητ = rτ h + e Vh+1 sτ h+1 Th e Vh+1 (sτ h, aτ h) /eσh(sτ h, aτ h). Denote

ξ := sup V [0,H], s Ph(s,a), h [H]

rh + V (s ) (Th V ) (s, a)

Suppose K e O(H12d3/κ5)14, then with probability 1 δ,

where e O absorbs constants and Polylog terms.

Now we are ready to prove the following key lemma, which gives a high probability bound for (Th e Vh+1 e Th e Vh+1)(s, a) .

Lemma D.13. Assume K > max{M1, M2, M3, M4}, for any 0 < λ < κ, suppose

where ξ := sup V [0,H], s Ph(s,a), h [H]

rh+V (s ) (Th V )(s,a)

. Then with probability 1 δ, for all

h, s, a [H] S A, (Th e Vh+1 e Th e Vh+1)(s, a) e O

ϕ(s, a) eΛ 1 h ϕ(s, a) + D

where eΛh = PK τ=1 ϕ(sτ h, aτ h)ϕ(sτ h, aτ h) /eσ2 h(sτ h, aτ h) + λI + K2,

d κ3/2 + H3

d κ3/2 + H3

and e O absorbs constants and Polylog terms.

14Note that here the assumption is stronger than the assumption in [Yin et al., 2022], therefore the conclusion of Lemma C.4 holds.

Proof of Lemma D.13. The proof is by combining (40), (42), Lemma D.4, Lemma D.6, Lemma D.7, Lemma D.8, Lemma D.12 and a union bound.

Remark D.14. Under the same assumption of Lemma D.13, because of Remark D.9 and Lemma D.11, we have with probability 1 δ, for all h, s, a [H] S A, (Th e Vh+1 e Th e Vh+1)(s, a) e O

ϕ(s, a) eΛ 1 h ϕ(s, a) + D

ϕ(s, a) bΛ 1 h ϕ(s, a) + 2D

ϕ(s, a) Λ 1 h ϕ(s, a) + 2D

Because D = e O H2L

d κ3/2 + H3

d and e O absorbs constant, we will write as below for simplicity: (Th e Vh+1 e Th e Vh+1)(s, a) e O

ϕ(s, a) Λ 1 h ϕ(s, a) + D

D.3.5 Finalize the proof of the first part

We are ready to prove the first part of Theorem 4.1.

Theorem D.15 (First part of Theorem 4.1). Let K be the number of episodes. Suppose

d > ξ, where

ξ := sup V [0,H], s Ph(s,a), h [H]

rh+V (s ) (Th V )(s,a)

and K > max{M1, M2, M3, M4}.

Then for any 0 < λ < κ, with probability 1 δ, for all policy π simultaneously, the output bπ of Algorithm 2 satisfies

h=1 Eπ h ϕ( , ) Λ 1 h ϕ( , ) 1/2i!

where Λh = PK τ=1 ϕ(sτ h,aτ h) ϕ(sτ h,aτ h)

σ2 e Vh+1(sτ h,aτ h) + λId, D = e O H2L

d κ3/2 + H3

d and e O absorbs

constants and Polylog terms.

Proof of Theorem D.15. Combining Lemma D.3 and Remark D.14, we have with probability 1 δ, for all policy π simultaneously,

V π 1 (s) V bπ 1 (s) e O

h=1 Eπ h ϕ( , ) Λ 1 h ϕ( , ) 1/2 s1 = s i!

now the proof is complete by taking the initial distribution d1 on both sides.

D.3.6 Finalize the proof of the second part

To prove the second part of Theorem 4.1, we begin with a crude bound on suph V h e Vh .

Lemma D.16 (Private version of Lemma C.8 in [Yin et al., 2022]). Suppose K max{M1, M2, M3, M4}, under the high probability event in Lemma D.13, with probability at least 1 δ,

V h e Vh e O

Proof of Lemma D.16. Step 1: The first step is to show with probability at least 1 δ, suph V h V bπ h e O H2

Indeed, combine Lemma D.3 and Lemma D.13, similar to the proof of Theorem D.15, we directly have with probability 1 δ, for all policy π simultaneously, and for all s S, h [H],

V π h (s) V bπ h (s) e O

t=h Eπ h ϕ( , ) Λ 1 t ϕ( , ) 1/2 sh = s i!

Next, since K max 512 log( 2Hd

δ ) κ2 , 4λ

, by Lemma E.13 and a union bound over h [H], with

probability 1 δ,

sup s,a ϕ(s, a) Λ 1 h 2

K sup s,a ϕ(s, a) (Λp h) 1 2

λ 1 min(Λp h) 2H

κK , h [H],

where Λp h = Eµ,h[σ 2 e Vh+1(s, a)ϕ(s, a)ϕ(s, a) ] and λmin(Λp h) κ H2 .

Lastly, taking π = π in (57) to obtain

0 V π h (s) V bπ h (s) e O

t=h Eπ h ϕ( , ) Λ 1 t ϕ( , ) 1/2 sh = s i!

This implies by using the condition K > max{ H2L2

κ2 , H4κ}, we finish the proof of Step 1.

Step 2: The second step is to show with probability 1 δ, suph e Vh V bπ h e O H2

Indeed, applying Lemma E.7 with π = π = bπ, then with probability 1 δ, for all s, h

e Vh(s) V bπ h (s) =

t=h Ebπ h b Qh(sh, ah) Th e Vh+1 (sh, ah) sh = s i

(e Th e Vh+1 Th e Vh+1)(s, a) + H Γh(s, a)

ϕ(s, a) Λ 1 h ϕ(s, a)

where the second inequality uses Lemma D.13, Remark D.14 and the last inequality holds due to the same reason as Step 1.

Step 3: The proof of the lemma is complete by combining Step 1, Step 2, triangular inequality and a union bound.

Then we can give a high probability bound of suph||σ2 e Vh+1 σ 2 h || .

Lemma D.17 (Private version of Lemma C.10 in [Yin et al., 2022]). Recall σ2 e Vh+1 =

max n 1, Var Ph e Vh+1 o and σ 2 h = max 1, Var Ph V h+1 . Suppose K max{M1, M2, M3, M4}, then with probability 1 δ,

sup h ||σ2 e Vh+1 σ 2 h || e O

Proof of Lemma D.17. By definition and the non-expansiveness of max{1, }, we have

σ2 e Vh+1 σ 2 h Vare Vh+1 Var V h+1

Ph e V 2 h+1 V 2 h+1 + (Ph e Vh+1)2 (Ph V h+1)2

e V 2 h+1 V 2 h+1 + (Ph e Vh+1 + Ph V h+1)(Ph e Vh+1 Ph V h+1)

2H e Vh+1 V h+1 + 2H Ph e Vh+1 Ph V h+1

The second inequality is because of the definition of variance. The last inequality comes from Lemma D.16.

We transfer q

ϕ(s, a) Λ 1 h ϕ(s, a) to q

ϕ(s, a) Λ 1 h ϕ(s, a) by the following Lemma D.18.

Lemma D.18 (Private version of Lemma C.11 in [Yin et al., 2022]). Suppose K max{M1, M2, M3, M4}, then with probability 1 δ,

ϕ(s, a) Λ 1 h ϕ(s, a) 2 q

ϕ(s, a) Λ 1 h ϕ(s, a), h, s, a [H] S A,

Proof of Lemma D.18. By definition q

ϕ(s, a) Λ 1 h ϕ(s, a) = ϕ(s, a) Λ 1 h . Then denote

K Λh, Λ h = 1

where Λ h = PK τ=1 ϕ(sτ h, aτ h)ϕ(sτ h, aτ h) /σ2 V h+1(sτ h, aτ h) + λI. Under the condition of K, by Lemma D.17, with probability 1 δ, for all h [H],

Λ h Λ h sup s,a

ϕ(s, a)ϕ(s, a)

σ 2 h (s, a) ϕ(s, a)ϕ(s, a)

σ2 e Vh+1(s, a)

σ 2 h (s, a) σ2 e Vh+1(s, a)

σ 2 h (s, a) σ2 e Vh+1(s, a)

σ 2 h (s, a) σ2 e Vh+1(s, a)

Next by Lemma E.12 (with ϕ to be ϕ/σV h+1 and C = 1), it holds with probability 1 δ,

Λ h Eµ,h[ϕ(s, a)ϕ(s, a) /σ2 V h+1(s, a)] + λ

Therefore by Weyl s inequality and the condition K > max{2λ, 128 log 2d H

δ , 128H4 log(2d H/δ)

κ2 }, the above inequality implies Λ h =λmax(Λ h ) λmax Eµ,h[ϕ(s, a)ϕ(s, a) /σ2 V h+1(s, a)] + λ

Eµ,h[ϕ(s, a)ϕ(s, a) /σ2 V h+1(s, a)] + λ

ϕ(s, a) 2 + λ

λmin(Λ h ) λmin Eµ,h[ϕ(s, a)ϕ(s, a) /σ2 V h+1(s, a)] + λ

λmin Eµ,h[ϕ(s, a)ϕ(s, a) /σ2 V h+1(s, a)] 4

1/2 κ 2H2 .

Hence with probability 1 δ, Λ h 2 and Λ 1 h = λ 1 min(Λ h ) 2H2

κ . Similarly, we can show

that Λ 1 h 2H2

κ holds with probability 1 δ by using identical proof.

Now apply Lemma E.11 and a union bound to Λ h and Λ h, we obtain with probability 1 δ, for all h, s, a [H] S A,

ϕ(s, a) Λ 1 h 1 + q Λ 1 h Λ h Λ 1 h Λ h Λ h ϕ(s, a) Λ 1 h

κ Λ h Λ h #

ϕ(s, a) Λ 1 h

ϕ(s, a) Λ 1 h

2 ϕ(s, a) Λ 1 h where the third inequality uses (59) and the last inequality uses K e O(H14d/κ5). The conclusion can be derived directly by the above inequality multiplying 1/

K on both sides.

Finally, the second part of Theorem 4.1 can be proven by combining Theorem D.15 (with π = π ) and Lemma D.18.

D.4 Put everything toghther

Combining Lemma D.1, Theorem D.15, and the discussion above, the proof of Theorem 4.1 is complete.

E Assisting technical lemmas

Lemma E.1 (Multiplicative Chernoff bound [Chernoff et al., 1952]). Let X be a Binomial random variable with parameter p, n. For any 1 θ > 0, we have that

P[X < (1 θ)pn] < e θ2pn

2 , and P[X (1 + θ)pn] < e θ2pn

Lemma E.2 (Hoeffding s Inequality [Sridharan, 2002]). Let x1, ..., xn be independent bounded random variables such that E[xi] = 0 and |xi| ξi with probability 1. Then for any ϵ > 0 we have

e 2n2ϵ2 Pn i=1 ξ2 i .

Lemma E.3 (Bernstein s Inequality). Let x1, ..., xn be independent bounded random variables such that E[xi] = 0 and |xi| ξ with probability 1. Let σ2 = 1

n Pn i=1 Var[xi], then with probability 1 δ we have 1 n

2σ2 log(1/δ)

3n log(1/δ).

Lemma E.4 (Empirical Bernstein s Inequality [Maurer and Pontil, 2009]). Let x1, ..., xn be i.i.d random variables such that |xi| ξ with probability 1. Let x = 1

n Pn i=1 xi and b Vn = 1

n Pn i=1(xi x)2, then with probability 1 δ we have 1 n

i=1 xi E[x]

2b Vn log(2/δ)

3n log(2/δ).

Lemma E.5 (Lemma I.8 in [Yin and Wang, 2021b]). Let n 2 and V RS be any function with ||V || H, P be any S-dimensional distribution and b P be its empirical version using n samples. Then with probability 1 δ,

Var b P (V )

n Var P (V )

Lemma E.6 (Claim 2 in [Vietri et al., 2020]). Let y R be any positive real number. Then for all x R with x 2y, it holds that 1 x y 1

E.1 Extended Value Difference

Lemma E.7 (Extended Value Difference (Section B.1 in [Cai et al., 2020])). Let π = {πh}H h=1 and π = {π h}H h=1 be two arbitrary policies and let { b Qh}H h=1 be any given Q-functions. Then define b Vh(s) := b Qh(s, ), πh( | s) for all s S. Then for all s S,

b V1(s) V π 1 (s) =

h=1 Eπ h b Qh (sh, ) , πh ( | sh) π h ( | sh) | s1 = s i

h=1 Eπ h b Qh (sh, ah) Th b Vh+1 (sh, ah) | s1 = s i (60)

where (Th V )( , ) := rh( , ) + (Ph V )( , ) for any V RS.

Lemma E.8 (Lemma I.10 in [Yin and Wang, 2021b]). Let bπ = {bπh}H h=1 and b Qh( , ) be the arbitrary policy and Q-function and also b Vh(s) = b Qh(s, ), bπh( |s) s S, and ξh(s, a) = (Th b Vh+1)(s, a) b Qh(s, a) element-wisely. Then for any arbitrary π, we have

V π 1 (s) V bπ 1 (s) =

h=1 Eπ [ξh(sh, ah) | s1 = s]

h=1 Ebπ [ξh(sh, ah) | s1 = s]

h=1 Eπ h b Qh (sh, ) , πh ( |sh) bπh ( |sh) | s1 = s i

where the expectation are taken over sh, ah.

E.2 Assisting lemmas for linear MDP setting

Lemma E.9 (Hoeffding inequality for self-normalized martingales [Abbasi-Yadkori et al., 2011]). Let {ηt} t=1 be a real-valued stochastic process. Let {Ft} t=0 be a filtration, such that ηt is Ftmeasurable. Assume ηt also satisfies ηt given Ft 1 is zero-mean and R-subgaussian, i.e.

λ R, E eληt | Ft 1 eλ2R2/2.

Let {xt} t=1 be an Rd-valued stochastic process where xt is Ft 1 measurable and xt L. Let Λt = λId + Pt s=1 xsx s . Then for any δ > 0, with probability 1 δ, for all t > 0,

2 log λ + t L

Lemma E.10 (Bernstein inequality for self-normalized martingales [Zhou et al., 2021]). Let {ηt} t=1 be a real-valued stochastic process. Let {Ft} t=0 be a filtration, such that ηt is Ft-measurable. Assume ηt also satisfies

|ηt| R, E [ηt | Ft 1] = 0, E η2 t | Ft 1 σ2.

Let {xt} t=1 be an Rd-valued stochastic process where xt is Ft 1 measurable and xt L. Let Λt = λId + Pt s=1 xsx s . Then for any δ > 0, with probability 1 δ, for all t > 0,

d log 1 + t L2

+ 4R log 4t2

Lemma E.11 (Lemma H.4 in [Yin et al., 2022]). Let Λ1 and Λ2 Rd d be two positive semi-definite matrices. Then: Λ 1 1 Λ 1 2 + Λ 1 1 Λ 1 2 Λ1 Λ2

ϕ Λ 1 1 1 + q

Λ 1 2 Λ2 Λ 1 1 Λ1 Λ2 ϕ Λ 1 2 .

for all ϕ Rd. Lemma E.12 (Lemma H.4 in [Min et al., 2021]). Let ϕ : S A Rd satisfies ϕ(s, a) C for all s, a S A. For any K > 0, λ > 0, define GK = PK k=1 ϕ(sk, ak)ϕ(sk, ak) + λId where (sk, ak) s are i.i.d samples from some distribution ν. Then with probability 1 δ, GK

Lemma E.13 (Lemma H.5 in [Min et al., 2021]). Let ϕ : S A Rd be a bounded function s.t. ϕ 2 C. Define GK = PK k=1 ϕ(sk, ak)ϕ(sk, ak) + λId where (sk, ak) s are i.i.d samples from some distribution ν. Let G = Eν[ϕ(s, a)ϕ(s, a) ]. Then for any δ (0, 1), if K satisfies

K max 512C4 G 1 2 log 2d

Then with probability at least 1 δ, it holds simultaneously for all u Rd that

Lemma E.14 (Lemma H.9 in [Yin et al., 2022]). For a linear MDP, for any 0 V ( ) H, there exists a wh Rd s.t. Th V = ϕ, wh and wh 2 2H

d for all h [H]. Here Th(V )(s, a) = rh(x, a) + (Ph V )(s, a). Similarly, for any π, there exists wπ h Rd, such that Qπ h = ϕ, wπ h with wπ h 2 2(H h + 1)

E.3 Assisting lemmas for differential privacy

Lemma E.15 (Converting z CDP to DP [Bun and Steinke, 2016]). If M satisfies ρ-z CDP then M satisfies (ρ + 2 p

ρ log(1/δ), δ)-DP. Lemma E.16 (z CDP Composition [Bun and Steinke, 2016]). Let M : Un Y and M : Un Z be randomized mechanisms. Suppose that M satisfies ρ-z CDP and M satisfies ρ -z CDP. Define M : Un Y Z by M (U) = (M(U), M (U)). Then M satisfies (ρ + ρ )-z CDP.

Lemma E.17 (Adaptive composition and Post processing of z CDP [Bun and Steinke, 2016]). Let M : X n Y and M : X n Y Z. Suppose M satisfies ρ-z CDP and M satisfies ρ -z CDP (as a function of its first argument). Define M : X n Z by M (x) = M (x, M(x)). Then M satisfies (ρ + ρ )-z CDP.

Definition E.18 (ℓ1 sensitivity). Define the ℓ1 sensitivity of a function f : NX 7 Rd as

1(f) = sup neighboring U,U f(U) f(U ) 1.

Definition E.19 (Laplace Mechanism [Dwork et al., 2014]). Given any function f : NX 7 Rd, the Laplace mechanism is defined as:

ML(x, f, ϵ) = f(x) + (Y1, , Yd),

where Yi are i.i.d. random variables drawn from Lap( 1(f)/ϵ).

Lemma E.20 (Privacy guarantee of Laplace Mechanism [Dwork et al., 2014]). The Laplace mechanism preserves (ϵ, 0)-differential privacy. For simplicity, we say ϵ-DP.

F Details for the Evaluation part

In the Evaluation part, we apply a synthetic linear MDP case that is similar to [Min et al., 2021, Yin et al., 2022] but with some modifications for our evaluation task. The linear MDP example we use consists of |S| = 2 states and |A| = 100 actions, while the feature dimension d = 10. We denote S = {0, 1} and A = {0, 1, . . . , 99} respectively. For each action a {0, 1, . . . , 99}, we obtain a vector a R8 via binary encoding. More specifically, each coordinate of a is either 0 or 1.

First, we define the following indicator function δ(s, a) = 1 if 1{s = 0} = 1{a = 0} 0 otherwise , then

our non-stationary linear MDP example can be characterized by the following parameters.

The feature map ϕ is: ϕ(s, a) = a , δ(s, a), 1 δ(s, a) R10. The unknown measure νh is:

νh(0) = (0, , 0, αh,1, αh,2) ,

νh(1) = (0, , 0, 1 αh,1, 1 αh,2) , where {αh,1, αh,2}h [H] is a sequence of random values sampled uniformly from [0, 1]. The unknown vector θh is:

θh = (rh/8, 0, rh/8, 1/2 rh/2, rh/8, 0, rh/8, 0, rh/2, 1/2 rh/2) R10,

where rh is also sampled uniformly from [0, 1]. Therefore, the transition kernel follows Ph(s |s, a) = ϕ(s, a), νh(s ) and the expected reward function rh(s, a) = ϕ(s, a), θh . Finally, the behavior policy is to always choose action a = 0 with probability p, and other actions uniformly with probability (1 p)/99. Here we choose p = 0.6. The initial distribution is a uniform distribution over S = {0, 1}.