# sample_complexity_of_variancereduced_distributionally_robust_qlearning__3fd2bf36.pdf Journal of Machine Learning Research 25 (2024) 1-77 Submitted 4/23; Revised 8/24; Published 10/24 Sample Complexity of Variance-Reduced Distributionally Robust Q-Learning Shengbo Wang shengbo.wang@stanford.edu Department of Management Science and Engineering Stanford University Nian Si niansi@ust.hk Department of Industrial Engineering and Decision Analytics The Hong Kong University of Science and Technology Jose Blanchet jose.blanchet@stanford.edu Department of Management Science and Engineering Stanford University Zhengyuan Zhou zz26@stern.nyu.edu Stern School of Business New York University Editor: Zhihua Zhang Dynamic decision-making under distributional shifts is of fundamental interest in theory and applications of reinforcement learning: The distribution of the environment in which the data is collected can differ from that of the environment in which the model is deployed. This paper presents two novel model-free algorithms, namely the distributionally robust Q-learning and its variance-reduced counterpart, that can effectively learn a robust policy despite distributional shifts. These algorithms are designed to efficiently approximate the q-function of an infinite-horizon γ-discounted robust Markov decision process with Kullback-Leibler ambiguity set to an entry-wise ϵ-degree of precision. Further, the variance-reduced distributionally robust Q-learning combines the synchronous Q-learning with variance-reduction techniques to enhance its performance. Consequently, we establish that it attains a minimax sample complexity upper bound of O(|S||A|(1 γ) 4ϵ 2), where S and A denote the state and action spaces. This is the first complexity result that is independent of the ambiguity size δ, thereby providing new complexity theoretic insights. Additionally, a series of numerical experiments confirm the theoretical findings and the efficiency of the algorithms in handling distributional shifts. Keywords: sample complexity, reinforcement learning, distributional robustness, Qlearning, stochastic approximation. 1. Introduction Reinforcement learning (RL) (Sutton and Barto, 2018) focuses on how agents can learn to make optimal decisions in uncertain and dynamic environments. It is based on the principle of trial-and-error learning, where the agent interacts with the environment, receives rewards or penalties for its actions, and adjusts its behavior to maximize the expected long-term reward. c 2024 Shengbo Wang, Nian Si, Jose Blanchet, Zhengyuan Zhou. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0526.html. Wang, Si, Blanchet, Zhou A significant obstacle in RL is the limited interaction between the agent and the environment, often due to factors such as data-collection costs or safety constraints. To overcome this, practitioners often rely on historical datasets or simulation environments to train the agent. However, this approach can suffer from distributional shifts (Quinonero-Candela et al., 2008) between the real-world environment and the data-collection/simulation environment, potentially leading to suboptimal learned policies when deployed in the actual environment. It is also observed in RL environments that an agent trained this way could be vulnerable to adversarial attacks (Lin et al., 2017; Pan et al., 2019). To tackle these challenges, distributionally robust reinforcement learning (DR-RL) (Zhou et al., 2021; Yang et al., 2021; Liu et al., 2022; Shi and Chi, 2022; Wang et al., 2023b) has emerged as a promising approach. DR-RL seeks to learn policies that are robust to distributional shifts in the environment by explicitly considering a family of possible distributions that the agent may encounter during deployment. This approach allows the agent to learn a policy that performs well across a range of environments, rather than just the one it was trained on. These benefits of distributionally robust policies motivate the exploration of a critical question: Can we construct efficient reinforcement learning algorithms that achieve the desired robustness properties while also providing provable guarantees on their sample complexity? A growing body of literature aims to understand the sample complexities of distributionally robust reinforcement learning. Specifically, we are interested in a robust tabular Markov Decision Process (MDP) with state space S and action space A, in the discounted infinite-horizon setting with discount factor γ. To account for uncertainty, we use an ambiguity set based on Kullback-Leibler (KL) divergence with ambiguity size δ, which is arguably the most natural and challenging divergence in distributionally robust literature. Previous research has mainly focused on the model-based approach, where a specific model of the environment is estimated, and value iteration (VI) is run on the estimated model. Table 1 shows the worst-case sample complexity of model-based distributionally RL, with Shi and Chi (2022) proposing a method with state-of-the-art sample complexity in terms of |S|, |A|, 1 γ, ϵ. Algorithm Sample Complexity Origin DRVI O(|S|2|A|e O(1 γ) 1(1 γ) 4ϵ 2δ 2) Zhou et al. (2021) REVI/DRVI O(|S|2|A|e O(1 γ) 1(1 γ) 4ϵ 2δ 2) Panaganti and Kalathil (2021) DRVI O(|S|2|A|(1 γ) 4ϵ 2p 2 δ 2) Yang et al. (2021) DRVI-LCB O(|S||A|(1 γ) 4ϵ 2p 1 δ 2) Shi and Chi (2022) Table 1: Summary of sample complexity upper bounds for finding an ϵ-optimal robust policy in model-based distributionally robust RL (p is the minimal support probability of the nominal MDP; see, Def. 6). Variance-Reduced Distributionally Robust Q-Learning 1.1 Our Motivation The emerging line of work mentioned above reflects the growing interest and fruitful results in the pursuit of sample-efficient distributionally robust reinforcement learning. At the same time, a closer scrutiny of the results suggests that two fundamental aspects of the problem are inadequately addressed. For one thing, the complexity bounds of existing results exhibit O(δ 2) dependence as δ 0. This increase in the complexity bounds appears to reflect an increased need for learning the training environment as the training and adversarial environments become more alike. At the surface level, this makes sense: in the extreme case where δ is approaching , then (assuming known support of the distributions) no sample is needed to find an optimal distributionally robust policy. Nevertheless, such bounds have failed to align with the continuity property of the robust MDP: the robust value function should converge to the non-robust optimal cumulative reward as δ 0. Therefore, for all sufficiently small δ that may depend on the training environment and ϵ, the robust value function can be approximated by the output of a classical RL algorithm. Specifically, we expect an algorithm and analysis with a O(1) dependence as δ 0. This is presently absent in the literature. Additionally, with the exception of Wang et al. (2023b) (discussed in more detail in the next subsection), all the existing distributionally robust policy learning algorithms that have finite-sample guarantees (such as the ones mentioned above (Zhou et al., 2021; Panaganti and Kalathil, 2021; Yang et al., 2021; Shi and Chi, 2022)) are model-based, which estimates the underlying MDP first before provisioning some policy from it. Although model-based methods are often more sample-efficient and easier to analyze, their drawbacks are also well-understood (Sutton and Barto, 2018; Fran cois-Lavet et al., 2018): they are computationally intensive; they require more memory to store MDP models and often do not generalize well to non-tabular RL settings. These issues limit the practical applicability of model-based algorithms, which stand in contrast to model-free algorithms that learn to select actions without first learning an MDP model. Such methods are often more computationally efficient, have less storage overhead, and better generalize to RL with function approximation. In particular, Q-learning (Watkins and Dayan, 1992), as the prototypical model-free learning algorithm, has widely been both studied theoretically and deployed in practical applications. However, Q-learning is not robust (as demonstrated in our simulations), and the policy learned by Q-learning in one environment can perform poorly in another under a worst-case shift (with bounded magnitude). As such, the above discussion naturally motivates the following research question: Can we design a variant of Q-Learning that is distributionally robust, where the sample complexity has the right scaling with δ? 1.2 Our Contributions We answer the above question affirmatively and contribute to the existing literature on the worst-case sample complexity theory of model-free distributionally robust RL. We propose two distributionally robust variants of the Q-learning algorithm (Watkins and Dayan, 1992), namely DR Q-learning (Algorithm 1) and variance-reduced DR Q-learning (Algorithm 2), which effectively solve the DR-RL problem under the KL ambiguity set. Wang, Si, Blanchet, Zhou The proposed algorithms operate efficiently under the assumption of limited power of the adversary (as per Assumption 1), which is realistic in many real-world applications. We prove that both algorithms have near-optimal worst-case sample complexity guarantees in this regime. Additionally, the variance-reduced version exhibits superior complexity dependence on the effective horizon (1 γ) 1, as shown in Table 2. To the best of our knowledge, both algorithms and their worst-case sample complexity upper bounds represent state-ofthe-art results in model-free distributionally robust RL. Moreover, our sample complexity upper bound for variance-reduced DR Q-learning matches the best-known upper bound for this DR-RL problem in Shi and Chi (2022) in terms of ϵ 2 and (1 γ) 4 dependence. Algorithm Sample Complexity Origin MLMC DR Q-learning O(|S||A|(1 γ) 5ϵ 2p 6 δ 4) Wang et al. (2023b) DR Q-learning O(|S||A|(1 γ) 5ϵ 2p 3 ) Theorem 9 Variance-reduced DR Q-learning O(|S||A|(1 γ) 4ϵ 2p 3 ) Theorem 13 Table 2: Summary of sample complexity upper bounds for finding an ϵ-optimal robust policy in model-free distributionally robust RL (p is the minimal support probability of the nominal MDP; see, Def. 6). The DR Q-learning Algorithm 1 is a direct extension of mini-batch Q-learning. Compared to the MLMC DR Q-learning method proposed by Wang et al. (2023b), Algorithm 1 is easier to implement in real-world applications. Additionally, this approach allows for the design of a more sophisticated variant, the variance-reduced DR Q-learning, which provides a provable enhancement of the worst-case sample complexity guarantee of DR Q-learning. To achieve this improvement, we leverage Wainwright s variance reduction technique and algorithm structure (Wainwright, 2019a), adapting it to the DR-RL context and redesigning the variance reduction scheme accordingly. Both the DR Q-learning and its variance-reduced version use a stochastic approximation (SA) step to iteratively update the estimator of the optimal DR q-function towards the fixed point of the population DR Bellman operator. However, both algorithms involve a bias that must be controlled at the algorithmic and iterative update levels. Our contribution to the literature lies in the near-optimal analysis of the biased SA resulting from DR Q-learning and its variance-reduced version. This analysis also generalizes to settings where the biased stochastic version of the contraction mapping is a monotonic contraction. We highlight that these are the first algorithmic complexity results showing that the worst-case complexity dependence on the uncertainty set size δ is O(1) as δ 0 for the DR-RL problem with a KL ambiguity set. This resolves the issue of worst-case complexity bounds blowing up as δ approaches 0, a problem present in all previous works, including both model-based and model-free approaches (Yang et al., 2021; Panaganti and Kalathil, 2021; Shi and Chi, 2022; Wang et al., 2023b). The significance of this characteristic lies in its theoretical illustration that as the adversary s power δ approaches 0, not only does the solution to the DR-RL problem converge to that of the non-robust version, but so does the sample complexity required to solve it. This sheds light on the connection between robust and non-robust RL problems, indicating that Variance-Reduced Distributionally Robust Q-Learning in more general settings and real-world applications, DR-RL problems with function approximation may be efficiently addressed by utilizing variants of the corresponding approach for non-robust RL problems. 1.3 Literature Review This section is dedicated to reviewing the literature that is relevant to our work. The literature on RL and MDP is extensive. One major line of research focuses on developing algorithms that can efficiently learn policies to maximize cumulative discounted rewards. When discussing RL and MDP problems, we will concentrate on this infinite horizon discounted reward formulation. Minimax Sample Complexity of Tabular RL: Recent years have seen significant developments in the worst-case sample complexity theory of tabular RL. Two principles, namely model-based and model-free, have motivated distinct algorithmic designs. In the model-based approach, the controller aims to gather a dataset so as to construct an empirical model of the underlying MDP and solving it using variations of the dynamic programming principle. Research (Azar et al., 2013; Sidford et al., 2018; Agarwal et al., 2020; Li et al., 2022) have proposed model-based algorithms and proven optimal upper bounds for achieving ϵ, with a matching lower bound Ω(|S||A|(1 γ) 3ϵ 2) proven in Azar et al. (2013). In contrast, the model-free approach involves maintaining only lower-dimensional statistics of the transition data, which are iteratively updated. As one of the most well known modelfree algorithm, the sample complexity of Q-learning been extensively studied (Even-Dar et al., 2003; Wainwright, 2019b; Chen et al., 2020; Li et al., 2021). However, Li et al. (2021) have shown that the Q-learning has a minimax sample complexity of Θ(|S||A|(1 γ) 4ϵ 2), which doesn t match the lower bound Ω(|S||A|(1 γ) 3ϵ 2). Nevertheless, variance-reduced variants of the Q-learning, such as the one proposed in Wainwright (2019a), achieve the aforementioned sample complexity lower bound. Other algorithmic techniques such as Polyak-Ruppert averaging (Li et al., 2023) has been shown to result in optimal sample complexity. Finite Analysis of SA: The classical theory of asymptotic convergence for SA has been extensively studied, as seen in Kushner and Yin (2013). Recent progress in the minimax and instant dependent sample complexity theory of Q-learning and its variants has been aided by advances in the finite-time analysis of SA. Traditional RL research focuses on settings where the random operator is unbiased. Wainwright (2019b) demonstrated a sample path bound for the SA recursion, which enables the use of variance reduction techniques to achieve optimal learning rates. In contrast, Chen et al. (2020, 2022) provided finite sample guarantees for SA only under a second moment bound on the martingale difference noise sequence. Additionally, research has been conducted on non-asymptotic analysis of SA procedures in the presence of bias, as documented in (Karimi et al., 2019; Wang, 2022). Robust MDP and RL: Our work draws upon the theoretical framework of classical max-min control and robust MDPs, as established in previous works (Gonz alez-Trejo et al., 2002; Iyengar, 2005; Nilim and El Ghaoui, 2005; Wiesemann et al., 2013; Xu and Mannor, 2010; Shapiro, 2022; Wang et al., 2024). These works have established the concept of distributional robustness in dynamic decision making. In particular, Gonz alez-Trejo et al. (2002); Iyengar (2005); Nilim and El Ghaoui (2005) established the distributionally Wang, Si, Blanchet, Zhou robust dynamic programming principles for SA-rectangular adversaries under symmetric information structures, while Wiesemann et al. (2013); Wang et al. (2024) studies asymmetric settings, leading to the same the DR Bellman equation. Recent research has shown great interests in learning DR policies from data (Si et al., 2020; Zhou et al., 2021; Yang et al., 2021; Liu et al., 2022; Shi and Chi, 2022; Wang et al., 2023b; Yang et al., 2023). For instance, (Si et al., 2020) studied the contextual bandit setting, while (Zhou et al., 2021; Panaganti and Kalathil, 2021; Yang et al., 2021; Shi and Chi, 2022) focused on the model-based tabular RL setting. On the other hand, (Liu et al., 2022; Wang et al., 2023b; Yang et al., 2023) tackled the DR-RL problem using a model-free approach1. Before our work, the best worst-case sample complexity upper bound for DR-RL under the KL ambiguity set was established for the model-based DRVI-LCB algorithm, as proposed and analyzed by Shi and Chi (2022). Their analysis showed that the worst-case sample complexity has an upper bound of O(|S||A|(1 γ) 4ϵ 2δ 2p 1 ). 2. Distributionally Robust Reinforcement Learning 2.1 Classical Tabular Reinforcement Learning Let M0 = (S, A, R, P0, N0, γ) be a Markov decision process (MDP), where S, A, and R R+ are finite state, action, and reward spaces2, respectively. Let P(U), where U = S, A, R, denote the set of probability measures on the power set 2U. Then P0 = {ps,a P(S), s S, a A} and N0 = {νs,a P(R), s S, a A} are the sets of transition and reward distributions, respectively. γ (0, 1) is the discount factor. Define rmax = max{r R} as the maximum reward. At each time t, given the state process is at St and the decision maker takes action At, the subsequent state is determined by the conditional distribution St+1 p St,At. Then, a randomized reward Rt νSt,At will be collected, independent of the history. Let Π be the history-dependent policy class (see (Wang et al., 2024) for a rigorous construction). For π Π, the value function vπ(s) is defined as: The optimal value function is v (s) := max π Π vπ(s), s S. It is well known that the optimal value function is the unique solution of the following Bellman equation: v (s) = max a A Eνs,a[R] + γEps,a[v (S)] . where the expectations are taken over R νs,a and S ps,a, respectively. 1. Liu et al. (2022) s algorithm is infeasible: it requires an infinite number of samples in expectation for each iteration, and only asymptotic convergence is established with an infinite number of iterations. 2. We assume a finite reward space for simplicity. However, our results can be extended to continuous reward spaces by imposing a minimum density assumption, as described in Si et al. (2020). Variance-Reduced Distributionally Robust Q-Learning An important implication of the Bellman equation is that it suffices to optimize within the stationary Markovian deterministic policy class. We define the optimal q-function as q (s, a) := Eνs,a[R] + γEps,a[v (S)]. It is well-know that q satisfies its Bellman equation q (s, a) = Eνs,a[R] + γEps,a max b A q (S, b) . An optimal policy can be constructed as π (s) = arg maxa A q (s, a). Therefore, policy learning in RL environments can be achieved if we can learn a good estimate of q . 2.2 Kullback-Leibler Divergence Constrained DR-RL We consider a DR-RL setting where the adversary is constrained to perturb both transition probabilities and rewards within a KL divergence ball of radius δ. Specifically, for probability measures Q is absolutely continuous w.r.t. P on some measurable space (Ω, F), denoted by Q P, define DKL(Q P) := Z d P (ω) P(dω), (2.1) d Q is the Radon-Nikodym derivative. For each (s, a) S A and δ > 0, we define KL ambiguity set that are centered at ps,a P0 and νs,a N0 of radius δ by Ps,a(δ) := {p : DKL (p ps,a) δ} , Ns,a(δ) := {ν : DKL(ν νs,a) δ} . (2.2) These ambiguity sets represent the possible distributional shifts from the reference model P0, N0. In particular, the parameter δ > 0 controls the size of the ambiguity sets, quantifying the power of the adversary. With these definitions in mind, we define the DR optimal value function as the solution to a fixed point equation a.k.a. the DR Bellman equation which serves as the learning objective of this paper. Definition 1 The DR Bellman operator Bδ for the value function is defined as the mapping Bδ(v)(s) := max a A inf p Ps,a(δ), ν Ns,a(δ) (Eν[R] + γEp [v(S)]) . (2.3) Define the DR optimal value function v δ as the solution of the DR Bellman equation: v δ = Bδ(v δ) (2.4) Moving forward, we will suppress the explicit dependence on δ. The DR Bellman equation has a unique solution as the fixed point of B, which is a consequence of B being a contraction operator. Furthermore, the solution is equal to the Wang, Si, Blanchet, Zhou max-min control optimal value of a SA-rectangular distributionally robust MDP (DRMDP) (Iyengar, 2005; Nilim and El Ghaoui, 2005; Wiesemann et al., 2013). Specifically, this maxmin optimal value is given by u (s) := sup π Π inf κ K Eπ,κ " X where Π is the history-dependent policy class, and the adversary chooses a policy κ from an adversarial ambiguity set K that is induced by the KL ambiguity sets in (2.2). Intuitively, this value represents the optimal reward in the following adversarial environment: When the controller selects a policy π, an adversary observes this policy and then chooses a counter-policy that determines the sequence of reward and transition distributions. The adversary s choice is constrained such that the reward and transition distributions induced by the counter-policy lie within the ambiguity set (2.2) of radius δ. The decisions made by both the controller and the adversary uniquely specify the law of the state-action-reward process, thereby determining the value of the policy pair (π, κ). The equivalence of the max-min control optimal value (2.5) and the solution to the DR Bellman equation (2.4) shows the optimality of stationary deterministic Markov control policies and stationary Markovian adversarial distribution choices. This equivalence, known as the dynamic programming principle (DPP), is explored in detail in Wang et al. (2024), where the adversary and controller can have asymmetric information structures. For those interested, we refer you to this paper. We note that Wang et al. (2024) considers a setting where the reward is not randomized, i.e., Ns,a = {δr(s,a)} for some reward function r : S A [0, 1]. However, it is straightforward to generalize the DPP to include randomized rewards in the SA-rectangular setting. 2.3 Dual and q-Function Formulations The right-hand side of (2.3) can be challenging to work with because the measure underlying the expectations is not directly accessible. To address this, we use strong duality to reveal the dependence of the value on the reference transition and reward distributions, P0 and N0. Specifically, we consider the dual representation: Lemma 2 (Hu and Hong (2013), Theorem 1) Let X be a random variable and µ0 be a probability measure on (Ω, F) s.t. X has a finite moment generating function in a neighborhood of zero. Then for any δ > 0, inf µ:DKL(µ µ0) δ EµX = sup α 0 n α log Eµ0 h e X/αi αδ o . Since the reward and values are bounded, directly apply Lemma 2 to the r.h.s. of (2.4), the DR value function v in fact satisfies the following dual form of the DR Bellman s equation. v (s) = max a A n α log Eνs,a h e R/αi αδ o + γ sup β 0 n β log Eps,a h e v (S)/βi βδ o) Similar to the traditional RL policy learning approach, we utilize the optimal DR stateaction value function, also known as the q-function, to address the DR-RL problem. The Variance-Reduced Distributionally Robust Q-Learning q-function assigns real numbers to pairs of states and actions, and can be represented as a matrix q RS A. From now on, we will assume this representation. To simplify notation, let us define v(q)(s) := max b A q(s, b), (2.6) which is the value function induced by the q-function q( , ). We proceed to rigorously define the optimal q-function and its Bellman equation. Definition 3 The optimal DR q-function is defined as q (s, a) := inf p Ps,a(δ), ν Ns,a(δ) (Eν[R] + γEp [v (S)]) (2.7) where v is the DR optimal value function in Definition 1. Similar to the Bellman operator, we can define the DR Bellman operator for the q-function as follows: Definition 4 Given δ > 0 and q RS A, the primal form of the DR Bellman operator T : RS A RS A is defined as T (q)(s, a) := inf p Ps,a(δ), ν Ns,a(δ) (Eν[R] + γEp [v(q)(S)]) (2.8) The dual form of the DR Bellman operator is T (q)(s, a) = sup α 0 n α log Eνs,a h e R/αi αδ o + γ sup β 0 n β log Eps,a h e v(q)(S)/βi βδ o . (2.9) The equivalence of the primal and dual form follows from Lemma 2. We remark that the dual form is usually easier to work with, as the outer supremum is a 1-d optimization problem and the dependence on the reference measures νs,a and ps,a are explicit. Note that by definition (2.7) and the Bellman equation (2.4), we have v(q ) = v . So, our definition implies that q is a fixed point of T and the following Bellman equation for the q -function holds: q = T (q ). (2.10) The uniqueness of the fixed point q of T follows from the contraction property of the operator T ; c.f. Lemma 19. The optimal DR policy can be extracted from the optimal q-function by π (s) = arg maxa A q (s, a). Hence the goal the DR-RL paradigm is to learn the DR q-function and extract the corresponding robust policy. Wang, Si, Blanchet, Zhou 2.4 Synchronous Q-Learning and Stochastic Approximation The Q-learning estimates the optimal q-function by iteratively update the estimator {qk : k 0} using samples generated by the reference measures. The classical synchronous Q-learning proceeds as follows. At iteration k Z 0 and each (s, a) S A, we draw samples Rk+1 νs,a and Sk+1 ps,a. Then perform the Q-learning update qk+1(s, a) = (1 λk)qk(s, a) + λk(Rk+1 + γv(qk)(Sk+1)) (2.11) for some chosen step-size sequence {λk}. The synchronous Q-learning can be analyzed as a stochastic approximation (SA) algorithm. SA for the fixed point of a contraction operator L refers to the class of algorithms using the update Xk+1 = (1 λk)Xk + λk L(Xk) + Wk+1. (2.12) {Wk} is a sequence satisfying E[Wk|Wk 1, . . . , W1] = 0 and some higher order moment conditions, thence is known as the martingale difference noise. The asymptotics of the above recursion are well-understood in the literature, as discussed in Kushner and Yin (2013). The recent developments of finite-time/sample behavior of SA is discussed in the literature review. The Q-learning recursion in (2.11) can be represented as an SA update if we notice that given any q-function, R+γv(q)(S) is an unbiased estimator of the population Bellman operator applied to q. However, the DR Q-learning and the variance-reduced version cannot be formulated in the same way as (2.12) with martingale difference noise, as there is bias present in the former algorithms. Consequently, to achieve the nearly optimal sample complexity bounds, we must conduct a tight analysis of these algorithms as biased SA, as we will explain in the subsequent sections. 3. The DR Q-Learning and Variance Reduction This section introduces two model-free algorithms, the DR Q-learning (Section 3.1) and its variance-reduced version (Section 3.2), for learning the optimal q-function of a robust MDP. We also present the upper bounds on their worst-case sample complexity. In addition, we outline the fundamental ideas behind the proof of the sample complexity results in Section 3.3. Prior to presenting the algorithms, we introduce several notations. Let νs,a,n and ps,a,n denote the empirical measure of µs,a and ps,a formed by n i.i.d. samples respectively; i.e. for f : U R, where U could be the S or R, Eµs,a,nf(U) := 1 j=1 f(Ui) (3.1) for µ = ν, p and Ui = Ri, Si are i.i.d. across i. Assuming access to a simulator, we are able to draw samples and construct an empirical version of the DR Bellman operator. Variance-Reduced Distributionally Robust Q-Learning Definition 5 Define the empirical DR Bellman operator on n i.i.d. samples by T(q)(s, a) := sup α 0 n α log Eνs,a,n h e R/αi αδ o + γ sup β 0 n β log Eps,a,n h e v(q)(S)/βi βδ o . (3.2) Note that T is a random operator whose randomness is coming from on the samples that we used to construct {νs,a,n, ps,a,n : (s, a) S A}. Definition 5 presents the empirical DR Bellman operator in its dual form. Lemma 2 establishes that this definition is equivalent to the DR Bellman operator T in (2.8) where the sets Ps,a(δ) and Ns,a(δ) are replaced with their empirical counterparts: {p : DKL(p ps,a,n) δ} and {ν : DKL(ν νs,a,n) δ}. The dual formulation of the empirical DR Bellman operator implies that it is generally a biased estimator of the population DR Bellman operator T in the sense that E [T(q)] = T (q) for a generic q RS A. This bias poses a significant challenge in the design of model-free algorithms and the analysis of sample complexities. Previous works Liu et al. (2022) and Wang et al. (2023b) eliminates this bias by using a randomized multilevel Monte Carlo estimator. However, the randomization procedure requires a random (and heavy-tailed) sample size. Therefore, the complexity bound is stated in terms of the expected number of samples. Also, this complex algorithmic design limits its generalizability. In contrast, this paper takes a different approach by directly analyzing the DR Q-learning and its variancereduced version as biased SA. To achieve near-optimal sample complexity guarantees, the bias of the empirical DR Bellman operator and the propagation of the systematic error it causes are tightly controlled, and samples are optimally allocated so that the stochasticity is in balance with the cumulative bias. A detailed discussion of this approach is provided in Section 3.3. To state the key assumption which constraint the operating regime of our algorithm, we introduce the following complexity metric parameter: Definition 6 Define the minimum support probability as p := min s,a S A min min r R:νs,a(r)>0 νs,a(r), min s S:ps,a(s )>0 ps,a(s ) . (3.3) The intuition behind the dependence of the MDP complexity on the minimal support probability is that in order to estimate the DR Bellman operator with high accuracy in the worst case, it is necessary to know the entire support of the transition and reward distributions. As a result, at least 1/p samples are required, as discussed in Wang et al. (2023b). We are now prepared to present the main assumption that defines the operating regime for which our algorithms are optimized. Assumption 1 (Limited Adversarial Power) Suppose the adversary s power δ satisfies δ < 1 It should be noted that the constant 1/24 is only for mathematical convenience and can potentially be improved. Wang, Si, Blanchet, Zhou Under this assumption, the adversary cannot collapse the support of the transition or reward distributions to a singleton, preventing them from completely restricting possible transition events under P0. This assumption regime is of practical significance because overly conservative policies can be produced if δ is large. Furthermore, the support of the reward and transition measures often encode physical constraints intrinsic to the real environment, which the adversary should not be allowed to violate. We also make the following simplifying assumption. Assumption 2 (Reward Bound) The reward R [0, 1]. This assumption is straightforward to remove given that the results of the empirical Bellman operator hold for R R 0. We assume it so as to clarify our presentation. 3.1 The Distributionally Robust Q-learning First, we proposed the DR Q-learning Algorithm 1, a robust version of the classical Qlearning that is based on iteratively update the q-function by applying the n-sample empirical Bellman operator. Algorithm 1 Distributionally Robust Q-Learning Input: the total times of iteration k0 and a batch size n0. Initialization: q1 0; k = 1. for 1 k k0 do Sample Tk+1 the n0-sample empirical DR Bellman operator as in Definition 5. Compute the Q-learning update qk+1 = (1 λk)qk + λk Tk+1(qk) (3.4) with stepsize λk = 1/(1 + (1 γ)k). end for return qk0+1. Algorithm 1 can be viewed as a biased SA: We can rewrite the update (3.4) as qk+1 = (1 λk)qk + λk T (qk) + λk(Tk+1(qk) T (qk)). This is in the form of (2.12). However, notice that E[Tk+1(qk) T (qk)|qk] = 0. Moreover, we note that the update (3.4) involves computing Tk+1(qk)(s, a) for all (s, a) S A. Unlike a model-based algorithm, which requires storing the entire empirical kernel and reward distributions {ps,a,n, νs,a,n : (s, a) S A}, the update rule (3.4) can be implemented separately for each state-action pair. This allows ps,a,n and νs,a,n to be discarded immediately after the update, significantly reducing the memory requirements for running Algorithm 1 when the state space is large. It turns out that, by leveraging the fact that the empirical Bellman operators are monotone contractions w.p.1 (as proven in Lemma 19), we can perform a stronger pathwise analysis of Algorithm 1 instead of treating it as a variant of the SA update in (2.12). As a result, we will prove in Section B.2 that the DR Q-learning algorithm satisfies the following error bound in Proposition 7. Variance-Reduced Distributionally Robust Q-Learning To simplify notation, we define the dimensionality parameter d := |S||A|(|S| |R|). It will only show up inside the log( ) term in our complexity bounds because of the use of union bound techniques. Proposition 7 Suppose that Assumptions 1 and 2 are satisfied. The output qk0+1 of the distributionally robust Q-learning satisfies qk0+1 q c 1 (1 γ)3k0 + 1 p3 (1 γ)2n0 + 1 p (1 γ)5/2 n0k0 (log (3dk0/η))2. with probability at least 1 η, where c is an absolute constant. By absolute constant , we mean a constant that does not depend on the complexity metric parameters ϵ, p , (1 γ) 1, η, d. Although the logarithmic term in the above proposition can be further improved, we will not focus on optimizing the logarithmic dependence in this paper. For clarity, we adjust the constant in the logarithmic factor using the inequality for C1 1, C2 e, log(C1C2) = log(C1) + log(C2) C1 log(C2), and incorporate C1 into c. These adjustments are applied to all subsequent convergence results. The proof of this Proposition, which is outlined in Section 3.3, will be postponed to Section B.2. Proposition 7 provides an upper bound on the terminal error in the estimator after k0 iterations of Algorithm 1. This bound is given by three terms that decay with rate O(k 1 0 ), O(n 1 0 ), and O((k0n0) 1/2), respectively, where the first and third terms resemble the upper bounds for standard Q-learning and the second term arises because of the bias. We optimize the algorithm parameters to balance these three terms and ensure that the right-hand side of the probability bound in Proposition 7 is less than ϵ. One way to achieve this is by selecting the parameters n0 and k0 as follows: Corollary 8 Assume Assumptions 1 and 2. Running Algorithm 1 with parameters k0 = c0 1 (1 γ)3ϵ log 3d (1 γ)ηϵ 2 and n0 = c0 1 p3 (1 γ)2ϵ log (3dk0/η)2 will produce an output qk0+1 s.t. qk0+1 q ϵ w.p. at least 1 η, where c is an absolute constant. An immediate consequence of Corollary 8 is the following the worst-case sample complexity upper bound of the robust Q-learning. Theorem 9 Assume Assumptions 1 and 2. Then the distributionally robust Q-learning Algorithm 1 with parameters specified in Corollary 8 computes a solution qk0+1 s.t. qk0+1 q w.p. at least 1 η using O |S||A| p3 (1 γ)5ϵ2 number of samples. Wang, Si, Blanchet, Zhou Proof The total number of samples used is |S||A|n0k0, implying the sample complexity upper bound. Theorem 9 provides a near-optimal worst-case sample complexity guarantee that matches and beats the expected sample complexity upper bound in Wang et al. (2023b) in all parameter dependence. In particular, we have shown that the dependence on δ is O(1) as δ 0. This resolves the issue of the worst-case complexity bound blowing up as δ 0 for KL divergence based DR-RL that present in all prior works (Yang et al., 2021; Panaganti and Kalathil, 2021; Shi and Chi, 2022; Wang et al., 2023b). 3.2 The Variance-Reduced Distributionally Robust Q-learning We adapt Wainwright s variance-reduced Q-learning (Wainwright, 2019a) to the robust RL setting. This is outlined in Algorithm 2. Algorithm 2 Variance-Reduced Distributionally Robust Q-Learning Input: the number of epochs lvr, a sequence of recentering sample size {ml}lvr l=1, an epoch length kvr and a batch size nvr. Initialization: ˆq0 0; l = 1; k = 1. for 1 l lvr do Compute e Tl, ml-sample empirical DR Bellman operator as in Definition 5. Set ql,1 = ˆql 1. for 1 k kvr do Sample Tl,k+1 an nvr-sample empirical Bellman operator. Compute the recentered Q-learning update ql,k+1 = (1 λk)ql,k + λk Tl,k+1(ql,k) Tl,k+1(ˆql 1) + e Tl(ˆql 1) (3.5) with stepsize λk = 1/(1 + (1 γ)k). end for Set ˆql = ql,kvr+1. end for return ˆqlvr As in the Q-learning case, the update rule (3.5) can be implemented separately for each state-action pair. Thus, Algorithm 2 does not require storing or performing computations using the entire empirical kernel and reward distribution. Before delving into the convergence rate theory of the DR variance-reduced Q-learning, we provide an intuitive description of this variance reduction scheme. The basic idea is to partition the algorithm into epochs. During each epoch, we perform a recentered version of stochastic approximation recursions with the aim of eliminating the variance component in the SA iteration ((2.12)). Specifically, instead of approximating q by one stochastic approximation, in each epoch, starting with an estimator ˆql 1, we recenter the SA procedure so that it approximates T (ql 1). However, since T is not known, we use e Tl(ql 1) as an natural estimator. By choosing a sequence of empirical DR Bellman operators Variance-Reduced Distributionally Robust Q-Learning with exponentially increasing sample sizes, we expect that the errors ˆql q decrease exponentially with high probability. This indeed holds true for Algorithm 2. The outer loop produces a sequence of estimators ˆql, l 1. We will show that if ˆql 1 is within some error from the optimal q , then ˆql will satisfy a better concentration bound by a geometric factor. This result is summarized in Proposition 10. Denote the σ-field generated by the random samples used until the end of epoch l by Fl. We define the conditional expectation El 1[ ] := E[ |Fl 1] and probability measure Pl 1( ) := El 1[1 { }]. Proposition 10 Assuming that Assumptions 1 and 2 are satisfied. On {ω : ˆql 1 q b} for some b 1/(1 γ), under measure Pl 1( )(ω), we have that there exists numerical constant c s.t. b (1 γ)2kvr + b p3/2 (1 γ)3/2 nvrkvr + b p3/2 (1 γ) nvr log (3dkvr/η)2 p3/2 (1 γ)2 ml w.p. at least 1 η, provided that ml 8p 2 log(24d/η) and nvr p 1 . Proposition 10 implies that if the variance reduced algorithm finds an approximation of q with infinity norm b, then the error after one epoch is improved accordingly with high probability. This and the Markovian nature of the sequence {ˆql} would imply a high probability bound for trajectories satisfying the pathwise property {ω : l lvr : ˆql q bl}. This is formalized by the next theorem where we use bl = 2 l(1 γ) 1. Let us define the parameter choice: for sufficiently large cvr absolute constant that doesn t depend on the complexity metric parameters ϵ, p , (1 γ) 1, η, d, define kvr = cvr 1 (1 γ)2 log 3dlvr (1 γ)η nvr = cvr 1 p3 (1 γ)2 log(3dkvrlvr/η)4, ml = cvr 4l p3 (1 γ)2 log(3dlvr/η)2. Notice that evidently ml 8p 2 log(24d/η) and nvr p 1 , satisfying the requirement of Proposition 10. Proposition 11 Assume Assumptions 1 and 2. For ϵ < (1 γ) 1, define parameters according to(3.6). Then, the sequence {ˆql, 0 l lvr} produced by Algorithm 2 satisfies the pathwise property that ˆql q 2 l(1 γ) 1 for all 0 l lvr w.p. at least 1 η. In particular, the final estimator ˆqlvr satisfies ˆqlvr q 2 lvr(1 γ) 1 w.p. at least 1 η. Wang, Si, Blanchet, Zhou Remark 12 The base of geometric growth in our choice of ml in (3.6) can be modified. The same proof as in Proposition 11 suggests that with ml = α2l Θ(p 3 (1 γ) 2) and lvr = logα ϵ 1(1 γ) 1 for some α > 1, we have ˆql q α l(1 γ) 1 for all 0 l lvr with probability at least 1 η. Running Algorithm 2 with this new parameter choice will yield the same sample complexity as in Theorem 13. The choice of base 4 in (3.6) was made only for clarity in our presentation. Proposition 11 immediately implies the following worst-case sample complexity upper bound. Theorem 13 Assume Assumptions 1 and 2. For ϵ < (1 γ) 1, the variance-reduced DR Q-learning Algorithm 2 with parameters specified in (3.6) computes a solution ˆqlvr s.t. ˆqlvr q ϵ w.p. at least 1 η using O |S||A| p3 (1 γ)4 min(1, ϵ2) number of samples. Proof Given the specified parameters, the total number of samples used is lvrnvrkvr + = O |S||A| 1 p3 (1 γ)4 + 4lvr This simplifies to the claimed result. Theorem 13 establishes an upper bound of O |S||A|(1 γ) 4ϵ 2p 3 when ϵ 1, which is superior to the upper bound O |S||A|(1 γ) 5ϵ 2p 3 for Algorithm 1 (see Theorem 9) in terms of 1 γ. This represents the best-known upper bound for DR-RL problems in the KL case, including both model-free and model-based algorithms (Shi and Chi, 2022). Although Shi and Chi (2022) achieve a similar rate of O (1 γ) 4 , their result suffers from a O δ 2 dependence, which becomes problematic as δ 0. In contrast, our upper bound is free from δ-dependence. We recall that the information-theoretical lower bound for the sample complexity of the classical tabular RL problem is Ω |S||A|(1 γ) 3ϵ 2 (Azar et al., 2013). In this setting, the variance-reduced Q-learning algorithm in Wainwright (2019a) is minimax optimal. For distributionally robust RL, Shi and Chi (2022) recently showed that the minimax lower bound dependence on |S||A|, (1 γ) 1, and ϵ remains Ω |S||A|(1 γ) 3ϵ 2 when δ is small. Furthermore, Shi et al. (2024) showed the information-theoretical lower bound may be Ω |S||A|(1 γ) 4ϵ 2 when δ = O(1) for χ2-divergence uncertainty sets. However, their construction of hard instances violates our Assumption 1. It is currently unknown whether variance-reduced DR Q-learning can achieve those rates. Further refinement of this bound is left for future research. Notice that the variance-reduced Algorithm 2 has the property that kvr, nvr, and ml only depend on 1 ϵ through log(lvr) = Θ(log log 1 ϵ). Therefore, within a reasonable range of ϵ, the algorithm can operate with the sample complexity guarantee in Theorem 13 without needing to tune kvr, nvr, and ml based on ϵ. This introduces significant versatility in application: for example, we can continue to run the algorithm beyond termination epoch lvr without losing sample efficiency. Variance-Reduced Distributionally Robust Q-Learning 3.3 Overview of the Analysis of Algorithms In this section, we provide a road map to proving the key results, Proposition 7 and 10. Definition 14 We say that L is a monotonic γ-quasi-contraction with center q if L(q) L(q ) γ q q , (3.7) and entrywise q1 q2 = L(q1) L(q2) (3.8) for all q, q1, q2 R|S| |A|. Moreover, a monotonic γ-contraction is such that the above identities hold for all q R|S| |A|. The term quasi refers to the fact that the relation 3.7 is only required for a single q (Wainwright, 2019b). Therefore, a monotonic γ-contraction is a quasi-contraction with center q for any q RS A. The successive application of monotonic γ-contractions under the rescaled linear stepsize λk = 1 1+(1 γ)k will satisfy the following deterministic bound: Proposition 15 (Corollary 1, Wainwright (2019b)) Let {Lk, k 2} be a family of monotonic γ-quasi-contractions with center q . Let Hk(q) = Lk(q) Lk(q ) the recentered operator. Then, for the sequence of step sizes {λk, k 1} the iterates of qk+1 q = (1 λk)(qk q ) + λk [Hk+1(qk) + wk+1] (3.9) for all k 1, where the sequence {pk, k 1} is defined by p1 = 0 and pk+1 := (1 λk)pk + λkwk+1. A key observation is that the empirical robust Bellman operators Tk, e Tl,k used in the iterative updates of Algorithms 1 and 2 are monotonic γ-contractions (see Lemma 19). In the proof of the main results, we apply the deterministic bound for contraction mappings from Proposition 15 to each sample path of the distributionally robust Q-learning and the inner loop of the variance-reduced version. We illustrate this by considering the distributionally robust Q-learning. Since {Tk+1, k 0} are monotonic γ-contractions, they are quasi-contractions with center q . We can define Hk+1(q) := Tk+1(q) Tk+1(q ) for all q RS A. Then, the update rule of Algorithm 1 can be written as qk+1 q = (1 λk)(qk q ) + λk [(Tk+1(qk) Tk+1(q )) + (Tk+1(q ) T (q ))] = (1 λk)(qk q ) + λk [Hk+1(qk) + Wk+1] . where Wk+1 := Tk+1(q ) T (q ) and we used the Bellman equation (2.10) that q = T (q ). Wang, Si, Blanchet, Zhou This representation allow as to apply Proposition 15 to bound the error of the q-function estimation using the sequence P1 = 0 and Pk+1 := (1 λk)Pk + λk Wk+1. Note that the only source of randomness in Wk is from Tk+1(q ), which are i.i.d.. Therefore, the process P is a non-stationary auto-regressive (AR) process. It follows that the concentration properties of Pk can be derived from that of Tk+1(q ). While standard Q-learning updates utilize an unbiased empirical Bellman operator, the DR empirical Bellman operator is biased due to its non-linearity in the empirical measure (c.f. (3.2)), resulting in E[Wk] = 0. To achieve a canonical error rate of O(n 1/2), it is necessary that both the bias and standard deviation of the n-sample DR empirical Bellman operator are O(n 1/2). However, our DR Q-learning algorithms require an additional condition: the one-step bias must be of the order O(n 1). This is because the final bias, which is the systematic error resulting from the repeated use of the DR Bellman estimator, is compounded by the one-step bias through the model-free Q-learning updates. We illustrate this algorithmic behavior using a simple biased SA instance in Appendix B.1. This imposes significant challenges on the design and analysis of our model-free algorithms. Fortunately, we are able to establish tight bounds (in n0 and δ) on the bias, c.f. Proposition 22, in the important regime when δ is small, as stated in Assumption 1. These bounds are central to our sample complexity analysis. We summarize the relevant bounds on the variance and bias of the empirical DR Bellman operator in Section A. By utilizing these variance and bias bounds, we can efficiently allocate samples such that the systematic error due to bias is balanced with the stochasticity in the estimator at the termination of the algorithm. With this optimal sample allocation, we can establish the worst-case sample complexity bounds as claimed. The theory for the convergence rate of the variance-reduced DR Q-learning is more complex. In order to achieve the geometric convergence in Proposition 11, an O(n 1) bias bound of the empirical DR Bellman operator is not enough. However, by introducing a recentered dynamics, a similar recursion can be derived in this context if we consider the conditionally recentered noise Hl,k+1(ˆql 1) E[Hl,k+1(ˆql 1)|ˆql 1] and a random bias (denoted by Dl in Appendix B.3). For details, please refer to Appendix B.3. 4. Numerical Experiments This section presents a numerical validation of our theoretical findings regarding the convergence properties of the proposed algorithms. We conduct a comparative analysis between our algorithms and MLMC DR Q-learning, as studied in Wang et al. (2023b). Additionally, we investigate the complexity of Algorithm 2 as the adversary s power δ 0. Section 4.1 demonstrates convergence and compares the proposed algorithms with multilevel Monte Carlo distributionally robust (MLMC DR) Q-learning. We use the hard MDP instances constructed in Li et al. (2021), where standard Q-learning performs at its worstcase complexity dependence of Ω((1 γ) 4ϵ 2). Both algorithms in this paper show the canonical convergence rate of O(ϵ 2), with the variance-reduced version displaying superior performance. Variance-Reduced Distributionally Robust Q-Learning In Section 4.2, we test the stability of sample complexity of the variance-reduced DR Q-learning Algorithm 2 as δ 0 using a simple DRMDP instance. In the subsequent developments, we use ml = 2l(1 γ) 2 for the variance-reduced Algorithm 2. As explained in Remark 12, this choice (up to a log factor) yields the same complexity guarantee as stated in Theorem 13. An advantage of this parameter choice is that it allows us to run more epochs for the plots, thereby clarifying the convergence behavior. 4.1 Hard MDPs for the Q-learning Figure 1: Hard MDP for the Q-learning transition diagram. First, we demonstrate the convergence of the proposed algorithms using the MDP instance shown in Figure 1. This MDP has 4 states and 2 actions, with transition probabilities given for actions 1 and 2 labeled on the arrows between states. Constructed in Li et al. (2021), it is shown in that when p = 4γ 1 3γ , standard non-robust Q-learning will have a sample complexity of Θ((1 γ) 4ϵ 2). (a) DR Q-learning (b) variance-reduced DR Q-learning Figure 2: Convergence of Algorithm 1 and 2 on the MDP instance 1 Figures 2a and 2b depict the convergence properties of the two algorithms for γ = {0.93, 0.95} and δ = 0.1. These figures show the (4000 samples) averaged error of the output q-function in the infinity norm plotted against the (4000 samples) averaged number of samples used, both on a log-log scale. The parameters for DR Q-learning in Figure 2a are set according to 8. On the other hand, Figure 2b plots the averaged error achieved by the variance-reduced algorithm after each epoch against the total number of samples used. Wang, Si, Blanchet, Zhou The figures indicate that both algorithms converge to the optimal robust q , with the variance-reduced algorithm outperforming DR Q-learning. Additionally, when comparing the log-log error plot with a reference line having a slope of 1/2, we observe that the log error for both algorithms decays at a rate of 1/2 as the log of the samples increases. This behavior aligns with the ϵ 2 dependence of the sample complexity bounds in Theorems 9 and 13, corresponding to the canonical convergence rate of Monte Carlo estimations, which is O(n 1/2). Remark 16 With δ = 0.1 and γ = 0.93 or 0.95, the DRMDP instances do not satisfy Assumption 1. However, the figures still show the canonical n 1/2 convergence rate, suggesting that our proposed algorithms might perform well even outside the regime prescribed by Assumption 1. Figure 3: Comparing the performance of Algorithm 1, 2 and the MLMC DR Q-learning on the MDP 1. Figure 3 compares the performance of the algorithms proposed in this paper with the MLMC DR Q-learning in Wang et al. (2023b). We observe the performance comparison of three Q-learning methods: MLMC DR, DR, and DR-VR, for γ {0.6, 0.7}. The results indicate that the distributionally robust variance-reduced Q-learning approach achieves the smallest errors. Although our DR Q-learning method shows slightly lower expected performance than the MLMC DR Q-learning, it is worth noting that the line corresponding to MLMC DR Q-learning is considerably rougher. This suggests that the MLMC DR Qlearning approach has a higher degree of variability in terms of performance. 4.2 Testing the Small δ Regime We proceed to empirically demonstrate the stability of the sample complexity of Algorithm 2 as δ 0. First, we introduce a family of MDPs instance. Define reference MPDs with S = {1, 2}, A = {a1, a2}, transition kernel P0,a1 = P0,a2 = 1/2 1/2 1/2 1/2 Variance-Reduced Distributionally Robust Q-Learning Figure 4: Testing the sample complexity behavior as δ 0. and deterministic reward function r(1, ) = 1 and r(2, ) = 0. For any positive adversarial power level δ, the worst-case transition kernel chosen by the adversary is Pδ,a1 = Pδ,a2 = q(δ) 1 q(δ) q(δ) 1 q(δ) where q(δ) < 1/2 and q(δ) 1/2 as δ 0. In a classical tabular RL setting, this worst-case MDP (δ > 0) should be easier to learn compared to (4.1), c.f. (Khamaru et al., 2021; Wang et al., 2023a). Using this DRMDP instance, we plot the average number of samples required to achieve a fixed error ϵ while varying δ, as shown in Figure 4. We observe that the average number of samples increases as δ 0, because the worst-case MDP converges to the instance in (4.1), which is more challenging to learn. Additionally, the number of samples needed to reach the target error level becomes insensitive to increasingly small δ when δ 10 2, confirming the theoretical results presented in this paper. 5. Extension: χ2 Divergence Ambiguity Sets We extend the variance-reduced version of the Q-learning Algorithm 2 to the setting where the adversary is constrained to perturbations within χ2 divergence balls of radius δ. The χ2 divergence is defined for Q P as Dχ2(Q P) := 1 d P (ω) 1 2 P(dω). (5.1) Note that we follow the convention in Duchi and Namkoong (2021) to include an 1/2 in (5.1). We reuse the notation for the KL case in the discussion of this section. In particular, for each (s, a) S A and δ > 0, we define χ2 ambiguity sets analogous to (2.2) as Ps,a(δ) := {p : Dχ2 (p ps,a) δ} , Ns,a(δ) := {ν : Dχ2(ν νs,a) δ} . (5.2) Wang, Si, Blanchet, Zhou For χ2 divergence defined in (5.1), we have the following strong duality. Lemma 17 (Duchi and Namkoong (2021), Lemma 1) Let X be a random variable and µ0 a probability measure on (Ω, F). Then, for any δ > 0, inf µ:Dχ2(µ µ0) δ EµX = sup α R n α c(δ)Eµ0 (α X)2 + 1 where c(δ) = 1 + 2δ and ( )+ := max { , 0}. We note that the dual variable α can be optimized within α ess infµ0 X. We wish to learn the optimal q-function as defined in (2.7). To achieve this, we use the DR Bellman equation for the q-function (2.10) where the dual form of the Bellman operator T : RS A RS A in the χ2 case is given by T (q)(s, a) := sup α R n α c(δ)Eνs,a (α R)2 + 1 2 o + γ sup β R n β c(δ)Eps,a (β v(q)(S))2 + 1 (5.4) Then, the empirical Bellman operator T is similarly defined as in (3.2) using this dual representation as T(q)(s, a) := sup α R n α c(δ)Eνs,a,n (α R)2 + 1 + γ sup β R n β c(δ)Eps,a,n (β v(q)(S))2 + 1 2 o , (5.5) where the empirical measures and expectations are defined in (3.1). Recall the definition of the minimum support probability p in (3.3). As in the KL case, we also consider the regime δ = O(δ): Assumption 3 (Limited Adversarial Power) Suppose the adversary s power δ < 1 In this context, we will apply the variance-reduced Q-learning Algorithm 2 with the following parameters. kvr = cvr 1 (1 γ)2 log 3dlvr (1 γ)η nvr = cvr 1 p2 (1 γ)2 log(3dkvrlvr/η)4, ml = cvr 4l p2 (1 γ)2 log(3dlvr/η)2. Notice that, compare to the specifications in (3.6), (5.6) has a p 2 dependence instead of p 3 . Running Algorithm 2 with these parameters will yield an estimate ˆqlvr of q with an error of at most ϵ with high probability, leading to the following theorem. Variance-Reduced Distributionally Robust Q-Learning Theorem 18 Assume Assumptions 2 and 3. For ϵ < (1 γ) 1, the variance-reduced DR Q-learning Algorithm 2 with parameters specified in (3.6) computes a solution ˆqlvr s.t. ˆqlvr q ϵ w.p. at least 1 η using O |S||A| p2 (1 γ)4 min(1, ϵ2) number of samples. The proof of Theorem 18 closely follows the proof of Theorem 13. We first establish the analog of Proposition 10 and then apply it to achieve the statement in Proposition 11 using the parameters in (5.6) for the χ2 divergence ambiguity set case. The sample complexity is then derived by summing the number of samples used in each epoch. This procedure is carried out in Appendix G. Acknowledgments The material in this paper is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-20-1-0397. Additional support is gratefully acknowledged from NSF 2229012, 2312204, 2312205, 2419564; ONR 13983111, 13983263; and 2024 New York University Center for Global Economy and Business grant. Alekh Agarwal, Sham Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 67 83. PMLR, 09 12 Jul 2020. URL https: //proceedings.mlr.press/v125/agarwal20b.html. Mohammad Gheshlaghi Azar, R emi Munos, and Hilbert J. Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn., 91(3):325 349, jun 2013. ISSN 0885-6125. doi: 10.1007/s10994-013-5368-1. URL https://doi.org/10.1007/s10994-013-5368-1. Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 8223 8234. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/ paper/2020/file/5d44ee6f2c3f71b73125876103c8f6c4-Paper.pdf. Zaiwei Chen, Sheng Zhang, Thinh T. Doan, John-Paul Clarke, and Siva Theja Maguluri. Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica, 146:110623, 2022. ISSN 0005-1098. doi: https: //doi.org/10.1016/j.automatica.2022.110623. URL https://www.sciencedirect. com/science/article/pii/S0005109822004873. Wang, Si, Blanchet, Zhou John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378 1406, 2021. Eyal Even-Dar, Yishay Mansour, and Peter Bartlett. Learning rates for q-learning. Journal of machine learning Research, 5(1), 2003. Vincent Fran cois-Lavet, Peter Henderson, Riashat Islam, Marc G Bellemare, Joelle Pineau, et al. An introduction to deep reinforcement learning. Foundations and Trends R in Machine Learning, 11(3-4):219 354, 2018. J. I. Gonz alez-Trejo, O. Hern andez-Lerma, and L. F. Hoyos-Reyes. Minimax control of discrete-time stochastic systems. SIAM Journal on Control and Optimization, 41(5): 1626 1659, 2002. doi: 10.1137/S0363012901383837. URL https://doi.org/10. 1137/S0363012901383837. Zhaolin Hu and L JeffHong. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 2013. Garud Iyengar. Robust dynamic programming. Math. Oper. Res., 30:257 280, 05 2005. Belhal Karimi, Blazej Miasojedow, Eric Moulines, and Hoi-To Wai. Non-asymptotic analysis of biased stochastic approximation scheme, 2019. Koulik Khamaru, Eric Xia, Martin J. Wainwright, and Michael I. Jordan. Instanceoptimality in optimal value estimation: Adaptivity via variance-reduced q-learning, 2021. URL https://arxiv.org/abs/2106.14352. H. Kushner and G.G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Stochastic Modelling and Applied Probability. Springer New York, 2013. ISBN 9781489926968. URL https://books.google.com/books?id=s B0GCAAAQBAJ. Gen Li, Changxiao Cai, Yuxin Chen, Yuantao Gu, Yuting Wei, and Yuejie Chi. Is q-learning minimax optimal? a tight sample complexity analysis. ar Xiv preprint ar Xiv:2102.06548, 2021. Gen Li, Laixi Shi, Yuxin Chen, Yuejie Chi, and Yuting Wei. Settling the sample complexity of model-based offline reinforcement learning, 2022. URL https://arxiv.org/abs/ 2204.05275. Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, and Michael I Jordan. A statistical analysis of polyak-ruppert averaged Q-learning. In International Conference on Artificial Intelligence and Statistics, pages 2207 2261. PMLR, 2023. Yen-Chen Lin, Zhang-Wei Hong, Yuan-Hong Liao, Meng-Li Shih, Ming-Yu Liu, and Min Sun. Tactics of adversarial attack on deep reinforcement learning agents. ar Xiv preprint ar Xiv:1703.06748, 2017. Variance-Reduced Distributionally Robust Q-Learning Zijian Liu, Qinxun Bai, Jose Blanchet, Perry Dong, Wei Xu, Zhengqing Zhou, and Zhengyuan Zhou. Distributionally robust q-learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 13623 13643. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/liu22a.html. Arnab Nilim and Laurent El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005. Xinlei Pan, Chaowei Xiao, Warren He, Shuang Yang, Jian Peng, Mingjie Sun, Jinfeng Yi, Zijiang Yang, Mingyan Liu, Bo Li, et al. Characterizing attacks on deep reinforcement learning. ar Xiv preprint ar Xiv:1907.09470, 2019. Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model, 2021. URL https://arxiv.org/abs/2112.01506. Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. Mit Press, 2008. Philippe Rigollet. 18. s997: High dimensional statistics lecture notes, 2015. Alexander Shapiro. Distributionally robust modeling of optimal control. Operations Research Letters, 50(5):561 567, 2022. ISSN 0167-6377. doi: https://doi.org/10.1016/j.orl. 2022.08.002. URL https://www.sciencedirect.com/science/article/pii/ S0167637722000992. Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy nski. Lectures on Stochastic Programming: Modeling and Theory, Second Edition. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2014. doi: 10.1137/1.9781611973433. URL https: //epubs.siam.org/doi/abs/10.1137/1.9781611973433. Laixi Shi and Yuejie Chi. Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity, 2022. URL https://arxiv.org/abs/2208. 05767. Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Matthieu Geist, and Yuejie Chi. The curious price of distributional robustness in reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 36, 2024. Nian Si, Fan Zhang, Zhengyuan Zhou, and Jose Blanchet. Distributionally robust policy evaluation and learning in offline contextual bandits. In International Conference on Machine Learning, pages 8884 8894. PMLR, 2020. Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ bb03e43ffe34eeb242a2ee4a4f125e56-Paper.pdf. Wang, Si, Blanchet, Zhou Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Martin J. Wainwright. Variance-reduced q-learning is minimax optimal, 2019a. URL https://arxiv.org/abs/1906.04697. Martin J. Wainwright. Stochastic approximation with cone-contractive operators: Sharp ℓ -bounds for q-learning, 2019b. URL https://arxiv.org/abs/1905.06265. Gang Wang. Finite-time error bounds of biased stochastic approximation with application to td-learning. IEEE Transactions on Signal Processing, 70:950 962, 2022. doi: 10.1109/ TSP.2021.3128723. Shengbo Wang, Jose Blanchet, and Peter Glynn. Optimal sample complexity of reinforcement learning for uniformly ergodic discounted markov decision processes, 2023a. Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. A finite sample complexity bound for distributionally robust Q-learning, 2023b. URL https://arxiv.org/abs/ 2302.13203. Shengbo Wang, Nian Si, Jose Blanchet, and Zhengyuan Zhou. On the foundation of distributionally robust reinforcement learning, 2024. Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8:279 292, 1992. Wolfram Wiesemann, Daniel Kuhn, and Bre c Rustem. Robust Markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013. ISSN 0364765X, 15265471. URL http://www.jstor.org/stable/23358653. Huan Xu and Shie Mannor. Distributionally robust Markov decision processes. In Advances in Neural Information Processing Systems, pages 2505 2513, 2010. Wenhao Yang, Liangyu Zhang, and Zhihua Zhang. Towards theoretical understandings of robust Markov decision processes: Sample complexity and asymptotics, 2021. URL https://arxiv.org/abs/2105.03863. Wenhao Yang, Han Wang, Tadashi Kozuno, Scott M. Jordan, and Zhihua Zhang. Avoiding model estimation in robust Markov decision processes with a generative model, 2023. Zhengqing Zhou, Zhengyuan Zhou, Qinxun Bai, Linhai Qiu, Jose Blanchet, and Peter Glynn. Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 3331 3339. PMLR, 13 15 Apr 2021. URL https://proceedings.mlr.press/v130/zhou21d.html. Variance-Reduced Distributionally Robust Q-Learning Appendix A. The Empirical Robust Bellman Operator: KL Case In this section, we establish the bias and concentration properties of the empirical DR Bellman operator. As pointed out in the previous sections, they are the key ingredients for proving our near-optimal sample complexity bounds. Let b Tn be the empirical DR Bellman operator formed by n samples defined in Definition 5. To simplify the notation, we will omit the subscript n and only keep b T when there is no confusion. Even though the main results of this paper restrict R [0, 1] to simplify notation and align with convention in the literature, in this section, we consider R [0, rmax]. This allows our results to be directly applied to contexts beyond RL, such as supervised learning where rmax may vary. In order to employ the analysis outlined in the previous section, the empirical Bellman operators need to be contraction mappings. Indeed, we have Lemma 19 b T is a monotonic γ-contraction. Direct consequences of b T being a γ-contraction with γ < 1 are the following bounds: Lemma 20 The following two bounds hold with probability 1: b T(q)(s, a) T (q)(s, a) 2(rmax + q ); As motivated in the paper, to obtain a desired complexity dependence on problem primitives, we need to develop good bounds on the bias and the variance of the empirical Bellman operator. We define the span seminorm of the q-function as |q|span := maxs,a q(s, a) mins,a q(s, a) and |q|span (1 γ) 1. The proofs of the following propositions are in Appendix C. Proposition 21 The empirical DR Bellman operator satisfies the following variance bound: Var(b T(q)(s, a)) 104 r2 max + γ2 |q|2 span p2 n (log(e(|R| |S|))). We note that here p can be replaced by mins S min {ps,a(s ), νs,a(s )}. In particular, the variance upper bound can depend on the state and action. However, since we are interested in a minimax complexity bound, such distinction will not make a difference if we consider an example with only O(1) number of states and actions. We also have the following bound on the bias: Proposition 22 Under Assumption 1, the empirical DR Bellman Operator satisfies the following bias bound: |Bias(b T(q)(s, a))| := |E[b T(q)(s, a)] T (q)(s, a)| 4480 rmax + γ |q|span p3 n log(e|S| |R|). Wang, Si, Blanchet, Zhou Again, the dependence on p can be replaced by mins S min {ps,a(s ), νs,a(s )}. By Lemma 20, the DR empirical Bellman operator is bounded. This along with the uniform (across s, a S A) variance bound in Proposition 21 yields: Proposition 23 The empirical DR Bellman operator b T(q) T (q) 17(rmax + γ |q|span) log (6|S||A|(|S| |R|)/η) w.p. at least 1 η, provided that n 8p 2 log (12|S||A|(|S| |R|)/η). Recall that for fixed ˆq, we haved defined the recentered DR Bellman operators b H(ˆq) := b T(ˆq) b T(q ) and H(ˆq) := T (ˆq) T (q ). (A.1) For the variance-reduced algorithm, we instead consider the bias and concentration properties of the recentered operator b H. As we will observe, the recentering allows the concentration bounds to depend on the residual error in the q-function ˆq q instead of ˆq . As a consequence, one can imagine that as the algorithm progresses, ˆql q will progressively become smaller, making b H having much better concentration properties than b T. We start with bias and variance bounds. Proposition 24 Suppose Assumption 1 is enforced. Then |E[ b H(ˆq)(s, a) H(ˆq)(s, a)]| 26 ˆq q provided n p 1 , and Var( b H(q ))(s, a) 212 ˆq q 2 p3 n log(e|S|) for all n 1. Similar to the extension from Proposition 21 to Proposition 23, we can obtain the following concentration bound for the recentered operator by extending the variance bound in Propositon 24. Proposition 25 Assume Assumption 1. Then w.p. at least 1 η H(ˆq) b H(ˆq) 8 ˆq q log(4|S|2|A|/η) provided that n 8p 2 log(4|S|2|A|/η) We emphasize that all of the propositions are O(1) when δ 0. This is due to a more thorough analysis, which allows us to remove the dual variable α (see Lemma 2) in the bounds, as explained in Lemma 37 in detail. Variance-Reduced Distributionally Robust Q-Learning Appendix B. Proofs for the Analysis of Algorithms: KL Case Before we prove the main results in this paper, we illustrate the necessity of requiring the nsample empirical Bellman operator to exhibit a bias of e O(n 1) to achieve an overall sample complexity of e O(ϵ 2). B.1 An Illustrative Example for the Necessity of the Bias Requirement To understand our requirement that the bias of the empirical Bellman operator (3.2) decay like O(n 1), we consider the following simple example. We use stochastic approximation to find the fixed point 0 of the equation x = γx, γ (0, 1). In the notation of (2.12), L(x) = γx. We generate biased noisy observation of x γx; i.e. Ln,k(x) = γx + Un,k + βn, where {Un, k : k = 1, . . .} are i.i.d., each requires n-sample to compute, with EUn,k = 0 and Var Un,k = σ2 n. The Un, k and βn are analogous to the noise and bias associated with an n-sample empirical Bellman operator. For simplicity, we consider a constant stepsize setting where the stepsize λ = eΘ(ϵ), ignoring the dependence on 1/(1 γ). We note that similar phenomena happen for discounted stepsize as well. We consider the stochastic approximation: Xt+1 = (1 λ)Xt + λ(γXt + Un,t+1 + βn) with X0 = 1. Then, it is not hard to see that Xt = (1 λ(1 γ))t + λ k=0 (1 λ(1 γ))k(Un,t k + βn). To achieve a statistical error of ϵ in mean squared, we need t = Ω 1 λ(1 γ) log 1 k=0 (1 λ(1 γ))k Un,t k = O(ϵ2), βnλ k=0 (1 λ(1 γ))k = O(ϵ). The variance requirement suggests λ2(1 (1 λ(1 γ))2t) λ(1 γ)(2 λ(1 γ)) σ2 n = O(ϵ2). So, with the choice of t earlier, we need σ2 n = O(ϵ2/λ) = e O(ϵ). On the other hand, the bias requirement implies βn λ(1 (1 λ(1 γ))t) λ(1 γ) = O(ϵ); i.e. βn = O(ϵ). Assume that the variance σ2 n = Θ(n 1), which is the canonical variance decay rate of Monte Carlo estimation. Then, to guarantee σ2 n = O(ϵ), we need n = Ω(1/ϵ). Yet, if Wang, Si, Blanchet, Zhou βn = Θ(n 1/2), we also need n = Ω(1/ϵ2) so as to have βn = O(ϵ). In this case, the total number of samples used until iteration t is nt = eΩ(ϵ 3). On the other hand, if we managed to show that βn = eΘ(n 1), we see that we only need n = eΩ(1/ϵ), and hence nt = eΩ(ϵ 2), the correct rate. Having explained the necessity of e O(n 1) bias and equipped with the key bias and concentration bounds, we are ready to carry out the proofs of the worst-case sample complexity bounds for Algorithm 1 and 2. We will follow the proof outlined in Section 3.3. B.2 The Distributionally Robust Q-learning Algorithm 1 B.2.1 Proof of Proposition 7 Proof Recall that the update rule for Algorithm 1 can be written as qk+1 q = (1 λk)(qk q ) + λk [Tk+1(qk) Tk+1(q ) + Tk+1(q ) T (q )] = (1 λk)(qk q ) + λk [Hk+1(qk) + Wk+1] (B.1) where we define Wk+1 := Tk+1(q ) T (q ). Since Tk+1(q ) is a i.i.d. sequence of estimators to T (q ), β := Bias(Tk(q )) = E[Tk(q )] T (q ) is independent of k. We can write Wk+1 = β + Uk+1 where Uk+1 := Tk+1(q ) T (q ) β has zero mean. Next, we would like to apply Proposition 15. Define the auxiliary sequences Pk+1 = (1 λk)Pk + λk Wk+1 (B.2) Qk+1 = (1 λk)Qk + λk Uk+1 (B.3) with Q0 = P0 = 0. Notice that since {Uk, k 1} has mean zero, E[Qk] = 0 for all k 0. It is easier to analyze the process {Qk, k 0} than {Pk, k 0} which correspond to {pk} in Proposition 15. To use {Qk, k 0}, we first show that Pk = Qk + β. We prove this by induction. The base case P1 = λ0W1 = Q1 + β as λ0 = 1. Next we check the induction step. By the iterative updates (B.2) and (B.3) and the induction hypothesis, we have that Pk+1 = (1 λk)Pk + λk Wk+1 = (1 λk)(Qk + β) + λk(Uk+1 + β) = Qk+1 + β. (B.4) Variance-Reduced Distributionally Robust Q-Learning By Algorithm 1, q1 = 0. We have that by Lemma 20, q1 q (1 γ) 1. Therefore, by Proposition 15 1 λ1(1 γ) + + Ql,k+1 + β . + Qk+1 + 2 β (B.5) w.p.1, where we used kλk = 1/(1/k + (1 γ)) 1/(1 γ). Next, we bound the sequence {Qk, k 1}. Lemma 26 The {Qk, k 1} sequence satisfies P( Qk+1 > t) 2|S||A| exp t2 λk(8γ(1 γ) 1t + 4 σ2(q ) ) where σ2(q )(s, a) = Var(Tk(q )(s, a)). The proof of Lemma 26 is in Appendix B.4. By applying Lemma 26, we have that 1 γ log (2|S||A|/η) + 2 p log (2|S||A|/η) log (2|S||A|/η) w.p. at least 1 η. To establish high probability bound using (B.5), we also need the following properties of the stepsize: Lemma 27 (Proof of Corollary 3,Wainwright (2019b)) The following inequalities hold: λj 2 (1 γ) λk ; j=1 λj log(1 + (1 γ)k) Wang, Si, Blanchet, Zhou We have that by Lemma 27 and the union bound, j=1 Qj + Qk0+1 λk0 log(1 + (1 γ)k0) (1 γ)2 + σ(q ) p log (4|S||A|k0/η) 1 γ + 2 σ(q ) p log (4|S||A|k0/η) . λk0 log(1 + (1 γ)k0) (1 γ)2 + σ(q ) p log (4|S||A|k0/η) 16 1 (1 γ)3k0 + 20 p (1 γ)5/2 n0k0 log (4dk0/η)2 w.p. at least 1 η, where we utilize Proposition 21 to bound σ(q ) . We use Proposition 22 to bound β. Then, from (B.5) we conclude that there exists constant c s.t. qk0+1 q c 1 (1 γ)3k0 + 1 p (1 γ)5/2 n0k0 log (4dk0/η)2 + c rmax + γ |q|span (1 γ)p3 n0 log(e|S| |R|) c 1 (1 γ)3k0 + 1 p (1 γ)5/2 n0k0 + 1 p3 (1 γ)2n0 log (4dk0/η)2 where c can change from line to line. Finally, note that for C1 1, C2 e, log(C1C2) = log(C1) + log(C2) C1 log(C2). So, log (4dk0/η)2 16 9 log (3dk0/η)2. This completes the proof. B.3 The Variance-Reduced Distributionally Robust Q-learning Algorithm 2 B.3.1 Proof of Proposition 11 Proof Recall that Fl be the σ-field generated by the random samples used until the end of epoch l and El[ ] = E[ |Fl], Pl[ ] = P[ |Fl], and Varl( ) = Var( |Fl). In the following proof, the probabilities are w.r.t. Pl 1( ). Recall that Hl,k = Tl,k(q) Tl,k(q ) and e Hl = e Tl,k(q) e Tl,k(q ). From the variance-reduced DR-RL (Algorithm 2) update rule, we have at epoch l, ql,k+1 q = (1 λk)(ql,k q ) + λk h Tl,k+1(ql,k) Tl,k+1(ˆql 1) + e Tl(ˆql 1) T (q ) i = (1 λk)(ql,k q ) + λk h Hl,k+1(ql,k) + Tl,k+1(q ) Tl,k+1(ˆql 1) + e Tl(ˆql 1) T (q ) i = (1 λk)(ql,k q ) + λk [Hl,k+1(ql,k) + Wl,k+1] (B.6) Variance-Reduced Distributionally Robust Q-Learning where we define Wl,k+1 = Tl,k+1(q ) Tl,k+1(ˆql 1) + e Tl(ˆql 1) T (q ). Notice that only the first two terms is dependent on k. We can write Wl,k+1 = Tl,k+1(q ) Tl,k+1(ˆql 1) + e Tl(ˆql 1) T (q ) = Hl,k+1(ˆql 1) + e Hl(ˆql 1) + e Tl(q ) T (q ) = [Hl,k+1(ˆql 1) El 1[Hl,k+1(ˆql 1)]] + e Hl(ˆql 1) + e Tl(q ) T (q ) El 1[Hl,k+1(ˆql 1)] = Ul,k+1 + Dl (B.7) where Ul,k+1 := Hl,k+1(ˆql 1) El 1[Hl,k+1(ˆql 1)] (B.8) Dl := e Hl(ˆql 1) + e Tl(q ) T (q ) El 1[Hl,k+1(ˆql 1)]. (B.9) Note that El 1[Hl,k+1(ˆql 1)] is constant in k. We will apply Proposition 15. Define the auxiliary sequences Pl,k+1 = (1 λk)Pl,k + λk Wl,k+1 (B.10) Ql,k+1 = (1 λk)Ql,k + λk( Ul,k+1) (B.11) with Ql,0 = Pl,0 = 0. Note that Ul,k+1 under El 1 are i.i.d. and has mean 0. So El 1[Ql,k] = 0 for any k 0. It is easier to analyze the process {Ql,k, k 0} than {Pl,k, k 0} which correspond to {pk} in Proposition 15. As in the DR Q-learning case (Equation (B.4)), the same induction argument implies that Pl,k = Ql,k + Dl. By the algorithm, ql,1 = ˆql 1, we have that ql,1 q b. Therefore, by Proposition 15 ql,k+1 q λk j=1 Ql,j + γk Dl + Ql,k+1 + Dl . + Ql,k+1 + 2 Dl w.p.1, where we used kλk = 1/(1/k + (1 γ)) 1/(1 γ). Next, we prove bounds for { Ql,k , k 0} and Dl . Lemma 28 Under measure Pl 1( ), Pl 1( Ql,j > t) 2|S||A| exp 4λj(γ ζl 1 t + σ2 l 1 ) where ζl 1 = ˆql 1 q and σ2 l 1(s, a) = Varl 1(Hl,k(ˆql 1)(s, a)). Wang, Si, Blanchet, Zhou The proof of this Lemma is deferred to Appendix B.4. Apply Lemma 28, we have that Ql,j 4λj ζl 1 log (2|S||A|/η) + 2 p log (2|S||A|/η). w.p. at least 1 η. Recall the definition of σ2 l 1 and Proposition 24. We have that by Lemma 27 and the union bound, j=1 Ql,j + Ql,kvr+1 λkvr log(1 + (1 γ)kvr) ζl 1 1 γ + σl 1 p log (4|S||A|kvr/η) + 4λkvr ζl 1 + 2 σl 1 p λkvr log (4|S||A|kvr/η) . λkvr log(e + (1 γ)kvr) ζl 1 1 γ + σl 1 p log (4|S||A|kvr/η) b (1 γ)2kvr + 26b p3/2 (1 γ)3/2 nvrkvr log (4|S||A|kvr/η)2 w.p. at least 1 η. For Dl, recall the definition in (B.9). We add and subtract H(ˆql 1) and write: Dl = e Hl(ˆql 1) H(ˆql 1) + e Tl(q ) T (q ) + (H(ˆql 1) El 1[Hl,k+1(ˆql 1)]) = e Hl(ˆql 1) H(ˆql 1) + e Tl(q ) T (q ) + El 1 [H(ˆql 1) Hl,k+1(ˆql 1)] . (B.14) Recall Propositions 23, 24, and 25, we have that by union bound, Dl c rmax + |q |span + ˆql 1 q log (12d/η) + El 1 (H(ˆql 1) Hl,k+1(ˆql 1)) c rmax + |q |span + ˆql 1 q log (12d/η) + c ˆql 1 q w.p. at least 1 η, provided that c is a large enough constant ml 8p 2 log(24d/η), and nvr p 1 . Variance-Reduced Distributionally Robust Q-Learning Finally, recall that ˆql = ˆql,kvr+1 and rmax = 1, combine (B.12), (B.13), and (B.15) we conclude that there exists absolute constant c s.t. b (1 γ)2kvr + b p3/2 (1 γ)3/2 nvrkvr log (8dkvr/η)2 + c rmax + |q |span + b p3/2 (1 γ) ml log(24d/η) + c b p3/2 (1 γ) nvr b (1 γ)2kvr + b p3/2 (1 γ)3/2 nvrkvr + b p3/2 (1 γ) nvr log (8dkvr/η)2 p3/2 (1 γ)2 ml w.p. at least 1 η, where we used |q |span 2 q 2/(1 γ) and b 1/(1 γ), c can change from line to line. Finally, note that for C1 1, C2 e, log(C1C2) = log(C1)+log(C2) C1 log(C2). This completes the proof of Proposition 10. B.3.2 Proof of Proposition 11 Proof By the definition of conditional probability n ˆql q 2 l(1 γ) 1o! ˆql q 2 l(1 γ) 1 ˆqn q 2 n(1 γ) 1 ! ˆql q 2 l(1 γ) 1 ˆqn q 2 n(1 γ) 1 ! where we note that ˆq0 = 0 and Lemma 20 implies that ˆq0 q (1 γ) 1 w.p.1, so the conditioned intersection and product can start from s = 1. Let ˆqs q 2 s(1 γ) 1 . We analyze the probability for l 1 P ˆql q 2 l(1 γ) 1 Al 1 = 1 P(Al 1)E h 1 n ˆql q 2 l(1 γ) 1o 1Al 1 i = 1 P(Al 1)E h 1 n ˆql 1 q 2 (l 1)(1 γ) 1o E h 1 n ˆql q 2 l(1 γ) 1o Fl 1 i 1Al 1 i = 1 P(Al 1)E h 1 n ˆql 1 q 2 (l 1)(1 γ) 1o Pl 1 1 n ˆql q 2 l(1 γ) 1o 1Al 1 i , Wang, Si, Blanchet, Zhou By Proposition 10, we recall conditioned on ˆql 1 q 2 (l 1)(1 γ) 1 =: b b (1 γ)2kvr + b p3/2 (1 γ)3/2 nvrkvr + b p3/2 (1 γ) nvr log (3dkvr/η)2 p3/2 (1 γ)2 ml w.p. at least 1 η. Therefore, by the parameter choice (3.6), we have that for sufficiently large cvr and for events ω ˆql 1 q 2 (l 1)(1 γ) 1 , Pl 1 1 n ˆql q 2 l(1 γ) 1o (ω) 1 η lvr ; (B.16) 1 n ˆql 1 q 2 (l 1)(1 γ) 1o Pl 1 1 n ˆql q 2 l(1 γ) 1o 1 η Therefore, we have P ˆql q 2 l(1 γ) 1 Al 1 1 η which further gives us n ˆql q 2 l(1 γ) 1o! lvr . (B.17) To finish the proof, we consider the function e(η) := 1 η Clearly, e(η) is C2 with derivatives e (η) = 1 η l 1 , e (η) = l 1 Note that e 0 if l 1. So e (η) is non-decreasing. Hence for all η 0, e (η) e (0). This implies that e(η) = e(0) + Z η Assumption ϵ < (1 γ) 1 implies that lvr 1. Therefore, we plug in this to (B.17) and conclude that P ˆqlvr q 2 lvr(1 γ) 1 P n ˆql q 2 l(1 γ) 1o! Variance-Reduced Distributionally Robust Q-Learning B.4 Proof of Lemma 26 and 28 To prove these two lemma, we introduce the following result: Lemma 29 (Wainwright (2019b), Lemma 2) Let {Yk R, k 1} be a sequence of i.i.d. zero mean ζ-bounded r.v.s with variance σ2. Define {Xk, k 0} by the recursion X0 = 0 Xk+1 = (1 λk)Xk + λk Yk+1, where λk = 1/(1 + (1 γ)k). Then E [exp(t Xk+1)] exp t2σ2λk for all |t| < 1/(ζλk). We first prove Lemma 28. Proof We use the same steps. Recall (B.11), where Ul,k is an i.i.d. sequence under El 1 given by (B.8). By Lemma 20 Ul,k Tl,k(ˆql 1) Tl,k(q ) + El 1[Tl,k(ˆql 1) Tl,k(q )] 2γ ˆql 1 q = 2γ ζl 1 . Notice that by construction, Tl,k(q )(s, a) are independent across s S, a A. Therefore, by Lemma 29, El 1 exp(λ Ql,k+1 ) = El 1 sup (s,a) S A max {exp(λQl,k+1(s, a)), exp( λQl,k+1(s, a))} (s,a) S A El 1 exp(λQl,k+1(s, a)) + E exp( λQl,k+1(s, a)) 2|S||A| exp λ2 σ2 l 1 λk 1 2γ ζl 1 λk|λ| for any λ < 1/(2γ ζl 1 λk). Therefore, by the Chernoffbound Pl 1( Ql,k+1 > t) 2|S||A| exp λ2 σ2 l 1 λk 1 2γ ζl 1 λk|λ| for any λ (0, 1/(2γ ζl 1 λk)). Choose λ = t 2γ ζl 1 λkt + 2 σ2 l 1 λk , we conclude that Pl 1( Ql,k+1 > t) 2|S||A| exp 4λk(γ ζl 1 t + σ2 l 1 ) Wang, Si, Blanchet, Zhou Next, we prove Lemma 26. Notice that we only need to modify the bounds on ζ and σ2. Proof Recall that {Ql,k, k 0} is given by recursive relation (B.3), where Uk has mean 0. By Lemma 20 Uk 2 Tk+1(q ) 2rmax + 2γ q and Var(Tk+1(q )(s, a)) = σ2(q )(s, a). Therefore, using the same arguments, we conclude that P( Qk+1 > t) 2|S||A| exp t2 λk(8γ(1 γ) 1t + 4 σ2(q ) ) Appendix C. Proofs of Properties of the Empirical Bellman Operator: KL Case C.1 Glossary of Notations and Basic Properties Before we present our proofs, we first define some technical notations. For finite discrete measurable space (Y, 2Y ), fixed u m2Y , and signed measure ν M (Y, 2Y ), let y Y ν(y)u(y) denotes the integral. For generic probability measure µ on (Y, 2Y ) and random variable u : Y R, let w = w(α) = e u/α; define the KL dual functional under the reference measure µ f(µ, u, α) := α log µ[e u/α] αδ. (C.1) We clarify that f(µ, u, 0) = limα 0 f(µ, u, α) = ess infµ u. We present two basic properties of the dual functional f for which the proofs are deferred to Appendix E. Lemma 30 For any ν µ, the dual functional is bounded u L (µ) sup α 0 f(ν, u, α) u L (µ) Lemma 31 The following bound holds w.p.1.: sup α 0 f(µ, u, α) sup α 0 f(µn, u, α) 2 |u|span , where |u|span = maxs S u(s) mins S u(s). Variance-Reduced Distributionally Robust Q-Learning Let µn be the empirical measure form by n i.i.d. samples drawn from µ. In the following development, we need to consider the perturbation analysis on the line of center measures {tµ + (1 t)µn : t [0, 1]}. So, it is convenient to define for µs,a = ps,a, νs,a µs,a,n(t) = tµs,a + (1 t)µs,a,n ms,a,n = µs,a µs,a,n gs,a,n(t, α) = f(µs,a,n(t), u, α). Note that we will not explicitly indicate the dependence of u for the function g, because it will always be the identity function when µ = ν and the value function when µ = p. We will also drop the dependence on (s, a) when clear. Our analysis involves many derivative computations. We use three type of derivative notations, two of which is explained here and the Radon-Nikodym derivative is introduced in the following paragraph. For a smooth function of multiple arguments g(t, αs,t) where αs,t could be dependent on parameters s, t, denote the partial derivatives by t, α; i.e. tg(t, αs,t) := lim ϵ 0 g(t + ϵ, αs,t) ϵ , αg(t, αs,t) := lim ϵ 0 g(t, αs,t + ϵ) On the other hand, when αs,t is also smooth in t, denote the total derivative w.r.t. t by dt; i.e. dtg(t, αs,t) := lim ϵ 0 g(t + ϵ, αs,t+ϵ) ϵ = tg(t, αs,t) + αg(t, αs,t) tαs,t The intuition behind our ability to remove the 1/δ dependence stems from the mutual absolute continuity (also known as equivalence) between the empirical worst-case transition kernel and reward distribution and the true ones. This holds if δ is sufficiently small and the empirical centers of the uncertainty sets are equivalent to the true centers. As a result, our techniques rely on the absolute continuity between the empirical measure µn and µ. We say that µ is absolute continuous w.r.t. another measure ν, denoted by ν µ, if for A 2Y , ν(A) = 0 implies that µ(A) = 0. We say that µ is equivalent to ν, denoted by µ ν, if ν µ and ν µ. Note that the empirical measure µn always satisfies µn µ w.p.1. For absolutely continuous measures ν µ, the Radon-Nikodym derivative is well defined: dν dµ(y) := ν(s) µ(s)1 {µ(s) = 0} . The proof strategy we will implement is to consider separately the good events on which µn and µ are close (so that we have µn µ) and the bad events where the empirical measure is not close to the reference model. This motivates us to define for p > 0 Ωn,p(µ) = ω : sup y |µn(ω)(y) µ(y)| p . (C.3) Then, in the DR-RL setting, define s,a Ωn,p(ps,a) \ s,a Ωn,p(νs,a) = ω : sup s,a sup s |ps,a,n(ω)(s ) ps,a(s )| p, sup s,a sup r |νs,a,n(ω)(r) νs,a(r)| p . Wang, Si, Blanchet, Zhou We frequently make use of the minimum support probability of certain measures such as µ, µs,a. This is denoted by µ := min {µ(s) : µ(s) > 0}, µs,a, := min {µ(s) : µ(s) > 0}. It is easy to see that the following lemma holds: Lemma 32 Suppose p < µ , then on Ωn,p(µ), µ µn and infy:µ(y)>0 µn(y) > µ p. Moreover, the empirical measures are satisfies the following concentrations: Lemma 33 Let µ be any probability measure on finite measure space (Y, 2Y ). Then, for any k = 1, 2, 3, , . . . P(Ωn,p(µ)c) 1 p2knk log(e2k 1|Y |)k. In particular, if we choose k = 1, P(Ωn,p(µ)c) 1 p2n log(e|Y |). This lemma follows from the subgaussian property of empirical measures on finite measure space; i.e. Lemma 39 holds. For absolutely continuous empirical measures, we also have the following lemma, again as a consequence of subgaussianity and hence Lemma 39. Lemma 34 Let ξn be another random measure on (Y, 2Y ). Let (Ω, F, P) be the probability space that supports ξn, µn. Suppose that µn ξn, µ ξn, and ξn(y) > p for all y s.t. ξn(y) = 0. Then, for all A F, the following bounds hold: L (ξn) 1A 1 p n L (ξn) 1A 1 p2n log(e|Y |). The proofs of these results are deferred to Appendix D. C.2 Proof of Proposition 21 Proof By definition, we have b T(q)(s, a) T (q)(s, a) sup β 0 |f(νs,a,n, id, β) f(νs,a, id, β)| + γ sup α 0 |f(ps,a,n, v(q), α) f(ps,a, v(q), α)| . (C.4) We will drop the s, a dependence for simplicity. This motivates us to look at the dual functional applied to generic measureable u : Y R. Let s Define w = e u/α. |f(µn, u, α) f(µ, u, α)| = |gn(0, α) gn(1, α)| = tgn(t, α) t=τ = α mn[w] µn(τ)[w] Variance-Reduced Distributionally Robust Q-Learning for some random variable τ (0, 1). To bound this, we introduce the following lemma for which the proof is deferred to E. Lemma 35 Let m = µ1 µ2 with µ1, µ2 µ and w = e u/α, we have that µ[w]2 3j inf κ R u κ j L (µ) To apply Lemma 35, we consider p 1 2µ . Then, By Lemma 32, on Ωn,p(µ), µn(t) µ for all t [0, 1]. So, on Ωn,p(µ), we have by Lemma 35 sup α 0 |f(µn, u, α) f(µ, u, α)| sup α 0 α mn[w] µn(τ)[w] inf κ R 3 u κ L (µ) = 3 |u|span Therefore, by partitioning Ωinto Ωn,p(µ)c and Ωn,p(µ), we bound E sup α 0 |f(µn, u, α) f(µ, u, α)|2 9 |u|2 span E dmn dµn(τ) L (µ) 1Ωn,p(µ) + 4 |u|2 span P(Ωn,p(µ)c) (C.5) where on Ωn,p(µ)c, we use the bound in Lemma 31. By Lemma 32, on Ωn,p(µ) for y s.t. µ(y) > 0, µn(y) µ p 1 2µ p. Since µ(y) > 0 implies that µ(y) µ , we have that µn(t)(y) p for any t [0, 1]. Therefore, Lemma 34 applies. On the other hand, Lemma 33 also applies and is used to bound P(Ωn,p(µ)c). Therefore, continue from (C.5), we have E sup α 0 |f(µn, u, α) f(µ, u, α)|2 13 |u|2 span p2n log(e|Y |). We conclude that choosing p = 1 2 min {νs,a, , ps,a, }, Var(b T(q)(s, a)) 2E sup β 0 |f(νs,a,n, id, β) f(νs,a, id, β)|2 + 2γ2E sup α 0 |f(ps,a,n, v(q), α) f(ps,a, v(q), α)|2 26 id 2 νs,a,span p2n log(e|R|) + 26γ2 v(q) 2 ps,a,span p2n log(e|S|) 26 r2 max + γ2 |q|2 span p2n log(e(|R| |S|)). Plugging in p = 1 2p , we obtain the claimed inequality in Proposition 21. Wang, Si, Blanchet, Zhou C.3 Proof of Proposition 22 Proof We consider for generic u and measure µ on (Y, 2Y ). We assume δ < 1 24µ , which will be guaranteed by Assumption 1. Since α f(µ, u, α) is continuous, and from Si et al. (2020) it is sufficient to optimize the Lagrange multiplier on compact set [0, δ 1 u L (µ)], there is an optimal Lagrange multiplier α n(t) that achieves supα 0 f(µn(t), u, α). The bias of the dual functional Bias(f(µn, u, α n)) = E(gn(0, α n(0)) gn(1, α ))1Ωn,p(µ) + E (gn(0, α n(0)) gn(1, α )) 1Ωn,p(µ)c =: E1 + E2. 4µ . Notice that by assumption, Then, the following Lemma 36 holds. Lemma 36 (Differentiability of the Dual Functional) Suppose δ < log(1 1 2µ ) and p 1 On Ωn,p(µ), t supα 0 gn(t, α) is C2((0, 1)) C[0, 1]. α = 0 iffu is µ essentially constant. So, α n(t) 0 and supα 0 gn(t, α) µ[u] If α > 0, then α n(t) > 0 for all t [0, 1] with dt sup α 0 gn(t, α) = α n(t) mn[w] and dtdt sup α 0 gn(t, α) = α n(t) mn[w]2 Varµ n(t)(u) µn(t)[w] + mn[uw] α n(t)µn(t)[w] mn[w]µn(t)[uw] α n(t)µn(t)[w]2 The proof of this result is deferred to Appendix E. So, on Ωn,p(µ), t gn(t, α n(t)) is C2(0, 1) C[0, 1]. By the (second order) mean value theorem, there exists random variable τ [0, 1] s.t. E1 = E dtgn(t, α n(t)) t=1 + 1 2dtdtgn(t, α n(t)) t=τ = E α mn[w] 2dtdtgn(t, α n(t)) t=τ µ[w] Eα mn[w] µ[w] 1Ωn,p(µ)c + E 1 2dtdtgn(t, α n(t)) t=τ 1Ωn,p(µ) = 0 E1,1 + E1,2 Variance-Reduced Distributionally Robust Q-Learning where Emn[u] = 0 for any function u. Recall Lemma 35. Since naturally µ µ, µn, |E1,1| 3 |u|span E dmn L (µ) 1Ωn,p(µ)c µ P(Ωn,p(µ)c) 3 |u|span µ p2n log(e|Y |), where we use Lemma 33 for the last inequality. On Ωn,p(µ), by Lemma 36, for all t [0, 1] either α n(t) = 0 or α n(t) > 0. In the first case, we have trivially E1,2 = 0. In the second case, dtdtgn(t, α n(t)) = dt tgn(t, α n(t)) = α n(t) mn[w]2 Varµ n(t)(u) µn(t)[w] + mn[uw] α n(t)µn(t)[w] µn(t)[uw]mn[w] α n(t)µn(t)[w]2 Next, we prove a finer characteristic when δ goes to 0. We need the following Lemma: Lemma 37 On Ωn,p(µ) with p < µ sup α 0 α3 mn[w] µn(t)[w] + mn[uw] αµn(t)[w] mn[w]µn(t)[uw] 2 136 inf κ R u κ 3 L (µ) Applying Lemma 35 and 37, we have that on Ωn,p(µ) |dtdtgn(t, α n(t))|1Ωn,p(µ) 3 inf κ R u κ L (µ) L (µ) + 136 infκ R u κ 3 L (µ) Varµ n(t)(u) L (µ) + 136 |u|span u µ n[u] 2 L (µ n) u µ n(t)[u] 2 L2(µ n) To bound the second ratio in the last inequality, we introduce the following lemma, whose proof is deferred to Appendix E as well. Lemma 38 Suppose δ 1 24µ and p 1 4µ . When the optimal Lagrange multiplier α > 0, worst-case measures µ n(t) = µn(t)[w ]/µn(t)[w] satisfies µ n(t)(y) 1 2µ on Ωn,p(µ). 24p , by Lemma 38, for some y s.t. µ n(t)(y ) > 0, u µ n[u] 2 L (µ n) u µ n(t)[u] 2 L2(µ n) = |u(y ) µ n[u]|2 µ n(t)(y )|u(y ) µ n[u]|2 + P y =y µ n(t)(y)|u(y) µ n[u]|2 |u(y ) µ n[u]|2 µ n(t)(y )|u(y ) µ n[u]|2 Wang, Si, Blanchet, Zhou As in the proof of Propositon 21, under the choice p 1 4µ , Lemma 34 applies. Therefore, |E1,2| 275 |u|span µ p2n log(e|Y |) For E2 in (C.6), we use Lemma 31 and previous bound on P (Ωn,p(µ)c) |E2| E |f(µn, u, α n(0)) f(µ, u, α )| 1Ωn,p(µ)c 2 |u|span P (Ωn,p(µ)c) p2n log(e|Y |) |u|span µ p2n log(e|Y |) Therefore, going back to (C.6), we have Bias sup α 0 f(µn, u, α) 280 |u|span µ p2n log(e|Y |). Apply this to the empirical Bellman operator with p = 1 4 min {ps,a, , µs,a, } and Assumption 1 holds. So, δ < 1 24p implies that δ < 1 24 min {ps,a, , µs,a, }. Therefore, we have |Bias(b T(q)(s, a))| = sup β 0 f(νs,a,n, id, β) + γBias sup α 0 f(ps,a,n, v(q), α) 4480 id νs,a,span + γ |v(q)|span p3 n log(e|S| |R|) 4480 rmax + γ |q|span p3 n log(e|S| |R|). C.4 Proof of Proposition 23 Proof We recall the bound (C.4) and the subsequent result sup α 0 |f(µn, u, α) f(µ, u, α)| 3 |u|span Again, we consider p 1 2µ . Also recall the definition (C.3) of Ωn,p(µ). By Lemma 32, on Ωn,p(µ) for y s.t. µ(y) > 0, µn(y) µ p 1 2µ p. Since µ(y) > 0 implies that Variance-Reduced Distributionally Robust Q-Learning µ(y) µ , we have that µn(t)(y) p for any t [0, 1]. Therefore, we have P sup α 0 |f(µn, u, α) f(µ, u, α)| > t P(Ωn,p(µ)c) + P L (µ) > t, Ωn,p(µ) P sup y |µn(y) µ(y)| > p + P 3 |u|span p sup y |mn(y)| > t exp( 2p2n) + exp 9 |u|2 span exp( 2p2n) + exp 9 |u|2 span where we used Hoeffding s inequality and union bound. Therefore, going back to the DR Bellman operator setting, we choose p = 1 4p and by union bound P( b T(q) T (q) > t) sup s,a sup β 0 |f(νs,a,n, id, β) f(νs,a, id, β)| > t + P sup s,a sup α 0 |f(ps,a,n, v(q), β) f(ps,a, v(q), β)| > t 2(|S|2|A| + |S||A||R|) exp p2 n + 2|S||A||R| exp p2 t2n 288r2max + 2|S|2|A| exp p2 t2n 288γ2 |q|2 span We set each of the three terms to be less than η/3 and find that it suffices to have p2 log (12|S||A|(|S| |R|)/η) t 17(rmax + γ |q|span) log (6|S||A|(|S| |R|)/η). This implies the statement of the proposition. C.5 Proof of Proposition 24 Proof We define V := H(ˆq) b H(ˆq) = (T (ˆq) T (q )) (b T(ˆq) b T(q )). (C.9) Wang, Si, Blanchet, Zhou Recall the dual formulation T (q)(s, a) = sup β 0 f(νs,a, id, β) + γ sup α 0 f(ps,a, v(q), α). The first term is not dependent on q, hence canceled in V . We have that |V (s, a)| = γ sup α 0 f(ps,a, v(ˆq), α) sup α 0 f(ps,a, v(q ), α) sup α 0 f(ps,a,n, v(ˆq), α) + sup α 0 f(ps,a,n, v(q ), α) Note that if v(ˆq) and v(q ) are both µ essentially constant, then V = 0, and the claim of Proposition 24 holds trivially. Therefore, moving forward, we consider the case at least one of v(ˆq) and v(q ) is not µ essentially constant. To analyze V while keeping the consistency of our notations, we define vt = tv(ˆq)+(1 t)v(q ), µ = ps,a, µn = ps,a,n, m = µ µn, and µ(t) = tµ (1 t)µn. Because Assumption 1 is imposed, we have that δ < 1 24µ . We consider the parametric function for s, t [0, 1] h(s, t) := sup α 0 f(µ(t), vs, α) = f(µ(t), vs, α s,t). (C.10) To motivates our analysis, we assume that h(s, ) is C1(0, 1) C[0, 1] and th( , t) is C1(0, 1) C[0, 1] as well. Then the fundamental theorem of calculus implies that |V (s, a)| = γ |h(1, 0) h(0, 0) h(1, 1) + h(0, 1)| 0 th(1, t)dt + Z 1 0 th(0, t)dt 0 s th(s, t)dsdt 0 | s th(s, t)| dsdt where s th(s, t) is easier to analyze. We proceed to show that (C.11) is valid (with some minor modification) on Ωn,p(µ). As in the proof of Proposition 22, Lemma 36 applies when we consider p 1 4µ . So, for p 1 4µ , on Ωn,p(µ), h(s, ) is C2(0, 1) C[0, 1] with derivative th(s, t) = dt sup α 0 f(µ(t), vs, α) = α s,t m[ws] µ(t)[ws]. Here, by Lemma 36, α s,t is the unique optimal Lagrange multiplier, and ws = e vs/α s,t. Next, we show that for every fixed t, there is a function Ds th s.t. Z 1 0 Ds th(s, t)ds = th(1, t) th(0, t). (C.12) We note that by Lemma 36, α s,t = 0 if and only if vs is essentially constant. This can only happen at one particular s = s . Otherwise, if there are some 0 s1 < s2 1, Variance-Reduced Distributionally Robust Q-Learning s1v(ˆq) + (1 s1)v(q ) = c1e and s2v(ˆq) + (1 s2)v(q ) = c2e w.p.1 under µ, where e is the vector of all ones, then for all a, b 0, a + b v(ˆq) + 1 as1 + bs2 v(q ) = (ac1 + bc2)e. This would imply that v(ˆq) and v(q ) are both essentially constant. We consider two cases: Case 1: vs is never essentially constant for all s [0, 1]. In this case, α s,t > 0 for all s [0, 1]. Note that s e vs/α is clearly C for α > 0. So, on Ωn,p(µ) if α s,t is C1(0, 1) in s, then s th(s, t) is C1(0, 1) C[0, 1]. We show differentiability of s α s,t by invoking the implicit function theorem as in the proof of Lemma 36. When α s,t > 0, as shown in Lemma 36, it is the unique solution to the optimality condition α s,t( log µ(t)[ws] δ) µ(t)[vsws] µ(t)[ws] =: F(s, α s,t) = 0. (C.13) Define the optimal measure µ (s, t)[ ] = µ(t)[ws ] Since for all fixed t, α s,t > 0 on (0, 1) and F is infinite smooth. The implicit function theorem then implies that α s,t is C1(0, 1) C[0, 1] and s th(s, t) is C1(0, 1) C[0, 1]. We compute the derivative s th in this case. Let v = v(ˆq) v(q ). Rewrite the optimality equation as α s,t( log µ(t)[ws] δ) = µ(t)[vsws] Differentiate w.r.t. s on both side LHS = sα s,t( log µ(t)[ws] δ) + µ(t)[ vws] µ(t)[ws] sα s,t µ(t)[vsws] α s,tµ(t)[ws] = sα s,t( log µ(t)[ws] δ) + µ (s, t)[ v] sα s,tµ (s, t)[vs/α s,t] RHS = µ(t)[ vws]µ(t)[vsws] α s,tµ(t)[ws]2 + µ(t)[ vws] µ(t)[ws] µ(t)[ vvsws] α s,tµ(t)[ws] µ(t)[vsws]2 (α s,t)2µ(t)[ws]2 + µ(t)[v2 sws] (α s,t)2µ(t)[ws] = Covµ (s,t) v, vs/α s,t + µ (s, t)[ v] + sα s,t Varµ (s,t)(vs/α s,t) From the optimality equation and the LHS and RHS derivatives, we have sα s,t log µ(t)[ws] + δ + µ (s, t)[vs/α s,t] + Varµ (s,t)(vs/α s,t) = Covµ (s,t)( v, vs/α s,t) sα s,t Varµ (s,t)(vs/α s,t) = Covµ (s,t)( v, vs/α s,t) sα s,t = Covµ (s,t)( v, vs/α s,t) Varµ (s,t)(vs/α s,t) . Wang, Si, Blanchet, Zhou Therefore, when α s,t > 0, s th(s, t) = s α s,tm[ws] µ(t)[ws] = m[ws]µ(t)[ vws] µ(t)[ws]2 + m[ vws] µ(t)[ws] sα s,t m[ws] µ(t)[ws] m[vsws] α s,tµ(t)[ws] + m[ws]µ(t)[vsws] α s,tµ(t)[ws]2 =: D1 + D2 + D3 + D4. Case 2: There is a unique s [0, 1] s.t. vs is essentially constant. In this case, the previous argument implies that s th(s, t) is C1(0, s ), C1(s , 1), and continuous at 0, 1. The derivative is also given by (C.15). We need to show the existence of Ds th that satisfy (C.12). Observe that if s th(s, t) is continuous at s , then applying the fundamental theorem of calculus on the interval [0, s ] and [s , 1] separately, we will have that th(1, t) th(0, t) = Z s 0 s th(s, t)ds + Z 1 s s th(s, t)ds. Hence, taking Ds th(s, t) = s th(s, t) for every s = s and Ds th(s , t) = 0 will suffice to produce (C.12). It is left to check the continuity at s . As analyzed in (E.1), lim α 0 α s,t m[ws] µ(t)[ws] = 0. So we can conclude the continuity of s th(s, t) at s , if we can show that when vs ce for some constant c, then α s,t 0. To prove this, we assume to the contrary that there is a subsequential limit α sn,t β > 0 for some sequence sn s . But since F defined (C.13) in s and α when α > 0, we must have that 0 = lim n F(sn, α sn,t) = β( log µ(t)[e ce/β] δ) c = δβ raising a contradiction. This implies that s th(s, t) is continuous at s , and hence (C.12) holds with Ds th(s, t) = s th(s, t) for every s = s and Ds th(s , t) = 0. Therefore, we have shown that the bound (C.11) is valid on Ωn,p(µ) with p 1 4µ . Now we bound s th(s, t) using the decomposition (C.15). |D1| and |D2| can be bounded using the change of measure techniques: on Ωn,p(µ) |D1| µ(t)[ dm dµ(t)ws]µ(t)[| v|ws] Variance-Reduced Distributionally Robust Q-Learning |D2| µ(t)[ dm dµ(t)| v|ws] To bound |D3| and |D4|, recall sα s,t from (C.14). |D3| = sα s,t m[ws] µ(t)[ws] Covµ (s,t)( v, vs/α s,t) Varµ (s,t)(vs/α s,t) m[ws] µ(t)[ws] Covµ (s,t)( v, vs) Varµ (s,t)(vs) α s,t m[ws] µ(t)[ws] Covµ (s,t)( v, vs) Varµ (s,t)(vs) inf κ R vs κ L (µ) Varµ (s,t)( v) vs µ (s, t)[vs] L (µ) q Varµ (s,t)(vs) 3 v vs µ (s, t)[vs] L (µ (s,t)) vs µ (s, t)[vs] L2(µ (s,t)) where (i) used Lemma 35 with j = 1. Since α s,t > 0 for s (0, 1) and vs is not essentially constant, by Lemma 38, for some s S s.t. vs(s ) µ (s, t)[vs] = 0 vs µ (s, t)[vs] 2 L (µ (s,t)) vs µ (s, t)[vs] 2 L2(µ (s,t)) = |vs(s ) µ (s, t)[vs]|2 µ (s, t)(s )|vs(s ) µ (s, t)[vs]|2 + P s =s µ (s, t)(s )|vs(s ) µ (s, t)[vs]|2 |vs(s ) µ (s, t)[vs]|2 µ (s, t)(s )|vs(s ) µ (s, t)[vs]|2 From (C.14), by the property of variance, Varµ (s,t)( v) Varµ (s,t)(vs/α s,t) v q Varµ (s,t)(vs/α s,t) . Wang, Si, Blanchet, Zhou Hence applying similar analysis, m[vsws] α s,tµ(t)[ws] + m[ws]µ(t)[vsws] α s,tµ(t)[ws]2 = | sα s,t| µ (s, t) dm dµ(t)vs/α s,t + µ (s, t) dm µ (s, t)[vs/α s,t] Varµ (s,t)(vs/α s,t) Covµ (s,t) dµ(t), vs/α s,t By (C.11), we have E|V | E|V |1Ωn,p(µ)c + γ Z 1 0 E | s th(s, t)| 1Ωn,p(µ)dsdt E|V |1Ωn,p(µ)c + γ sup s,t (0,1) E(|D1| + |D2| + |D3| + |D4|)1Ωn,p(µ). Recall the definition (C.9) of V , V = (T (ˆq) T (q )) (b T(ˆq) b T(q )) T (ˆq) T (q ) + b T(ˆq) b T(q ) 2γ ˆq q . So, by Lemma 33, E|V |1Ωn,p(µ)c 2γ ˆq q P(Ωn,p(µ)c) p2n log(e|S|) (C.16) By the previous bounds on |Di|, i = 1, 2, 3, 4, E|V | 2γ ˆq q p2n log(e|S|) + γ sup s,t (0,1) L (µ) 1Ωn,p(µ) µ2 n log(e|S|) + 8 v µ2 n log(e|S|) + 25 ˆq q p3/2 n log(e|S|) where we choose p = 1 4p and the last inequality follows from the assumption in Proposition 24 that n p 1 . Variance-Reduced Distributionally Robust Q-Learning To bound the variance, note that Var(b T( x) b T(x )) EV 2 and EV 21Ωn,p(µ) γ2 Z 1 0 ( s th(s, t))2dsdt which follows from applying Jensen s inequality to the [0, 1]2 integral. Therefore, Var(b T(ˆq) b T(q )) 8γ2 ˆq q 2 P(Ωn,p(µ)c) + γ2E Z 1 0 4(D2 1 + D2 2 + D2 3 + D2 4)dsdt1Ωn,p(µ) 27γ2 ˆq q 2 µ2 n log(e|S|) + 112 µ v 2 sup s,t (0,1) E dm dµ(t) L (µ) 1Ωn,p(µ) 27γ2 ˆq q 2 µ2 n log(e|S|) + 211 ˆq q 2 µ3 n log(e|S|). 212 ˆq q 2 p3 n log(e|S|). C.6 Proof of Proposition 25 Proof Recall the notations and definitions in as the proof of Proposition 24 in Appendix C.5 and, in particular, the definition (C.9) and bound (C.11) for V . We again choose p 1 4ps,a, . As Appendix C.5, we have that |V (s, a)| |V |1Ωn,p(µ)c + γ sup s,t (0,1) (|D1| + |D2| + |D3| + |D4|)1Ωn,p(µ) where µ = ps,a. Since Assumption 1 is assumed, the bounds on D1, D2, D3, D4 are still applicable. Therefore, by Hoeffding s inequality and union bound P(|V (s, a)| > t) P(Ωn,p(ps,a)c) + P γ sup s,t (0,1) (|D1| + |D2| + |D3| + |D4|) > t, Ωn,p(ps,a) P sup s S |ps,a,n(s ) ps,a(s )| > p + P 8 ˆq q (ps,a, p) ps,a, sup s S |mn(s )| > t P |mn(s )| > p + P p3/2 s,a, |mn(s )| > t exp 2p2n + exp p3/2 s,a, t2n 56 ˆq q 2 Wang, Si, Blanchet, Zhou where mn = ps,a,n ps,a. Then, as p ps,a, for all (s, a) S A, by union bound P( V > t) 2|S|2|A| exp p2 n + exp p3 t2n 56 ˆq q 2 We first control the first term to be less than η/2, which is implied by p2 log(4|S|2|A|/η). Finally, the second term less than η/2 is implied by choosing t2 = 56 ˆq q p3 n log(4|S|2|A|/η). This proves the claimed result. Appendix D. Proof of Technical Lemmas: Empirical Measures and Concentrations The proofs in the rest of this section is based on the following concentration property of maximum subgaussian random variables. D.1 Subgaussian Maximum Inequality Lemma 39 Let {Yi, i = 1 . . . n} be σ2-sub-Gaussian with zero means, not necessarily independent, then EZ := E max i=1...n |Yi|k 2kσk (k 1 + log n)k/2 . Proof For any λ > 0, consider an increasing function φλ(z) = exp(λz1/k) for z 0. Since Z 0, φλ(EZ) = φλ(EZ1 {Z > u} + EZ1 {Z u}) φλ(EZ1 {Z > u} + u P(Z u)) Take second derivatives, φ λ(z) = k 2λz1/k 2eλz1/k(λz1/k k + 1); one can see that φλ(z) is convex for z (k 1)kλ k. Let u = (k 1)kλ k. By Jensen s inequality φλ(EZ) Eφλ(Z + (k 1)kλ k) = ek 1E exp(λ max i=1...n |Yi|) i=1 Eeλ|Yi| Variance-Reduced Distributionally Robust Q-Learning Since {Yi} are Sub-Gaussian, P(|Yi| > t) 2 exp t2 By Rigollet (2015, Lemmas 1.4 and 1.5), one can show that log Eeλ|Yi| log(E[eλYi + e λYi]) log(2 exp(σ2λ2/2)) 4σ2λ2. λ E max i=1...n |Yi|k 1/k = log φλ(EZ) k 1 + log n + 4σ2λ2. Rearrange and take infimum over λ > 0, we conclude E max i=1...n |Yi|k inf λ>0 k 1 + log n 2kσk (k 1 + log n)k/2 D.2 Proof of Lemma 33 Proof By definition and Markov s inequality P(Ωn,p(µ)c) = P sup y |µn(y) µ(y)| > p 1 p2k E sup y |µn(y) µ(y)|2k = 1 p2kn2k E i=1 1 {Yi = y} µ(y) Since Pn i=1 1 {Yi = y} µ(y) is n/4 sub-Gaussian, by Lemma 39 P(Ωn,p(µ)c) 1 p2knk (2k 1 + log(|Y |))k = 1 p2knk log(e2k 1|Y |)k as claimed. D.3 Proof of Lemma 34 Proof Note that by Jensen s inequality, Wang, Si, Blanchet, Zhou So it suffices to show the second claim. By assumption, L (ξn) 1A 1 p2 E sup y |mn(y)|21A. Same as the proof of Lemma 33, we use Lemma 39 to conclude that L (ξn) 1A 1 p2n log(e|Y |). Appendix E. Proof of Technical Lemmas: KL Case E.1 Proof of Lemma 30 Proof sup α 0 f(ν, u, α) lim α 0 f(ν, u, α) = ess inf ν u ess inf µ u u L (µ) On the other hand, since the sup is achieved on compact K. For optimal α ν > 0, sup α 0 f(ν, α) u L (ν) α ν log ν[e (u u L (ν))/α ν] where the last line follows from that ν[e (u u L (ν))/α ν] > 0 and ν µ. Also, if α ν = 0, the above holds trivially. E.2 Proof of Lemma 31 Proof Let α and α n Use Lemma 30, sup α 0 f(µ, u, α) sup α 0 f(µn, u, α) = α n log µn[e u/α n] + α nδ α log µ[e u/α ] α δ α n(0) log µn[e (u κ)/α n(0)] + α nδ α log µ[e (u κ)/α ] α δ inf κ R |f(µn, u κ, α n(0))| + |f(µ, u κ, α )| 2 inf κ R u κ L (µ) = 2 |u|span Variance-Reduced Distributionally Robust Q-Learning E.3 Proof of Lemma 32 By definition, on Ωn,p(µ), |µn(y) µ(y)| p. So for all y s.t. µ(y) > 0 0 < µ p µ(y) p µn(y). Moreover, if µn(y) = 0, then 0 µ(y) p, we must have that µ(y) = 0. So, µn µ and hence µn µ. E.4 Proof of Lemma 35 Proof First we note that for any κ R µ[e u/α] = m[e (u κ)/α] µ[e (u κ)/α] . Therefore, it suffices to show that for m = µ1 µ2 s.t. µ µ1, µ2 µ[w]2 9 u j L (µs,a) Fix any c > 0, write µ[w]2 = max sup α [0,c u ] µ[w]2 , sup α c u =: max {J1(c), J2(c)} . We first bound J2(c) J2(c) = sup α c u L (µ) αjm[e (u+ u L (µ))/α]2 µn[e (u+ u L (µ))/α]2 For simplicity, let w := e (u+ u L (µ))/α. Recall that m = µn µ, so m[1] = 0 and αjm[e (u+ u L (µ))/α]2 = (m[αj/2(e (u+ u L (µ))/α 1)])2. Define and note that v := αj/2(e (u+ u L (µ))/α 1) < 0. Then µ[w ]2 = m[v]2 = 1 µ[w ]2 µ dm We defer the proof of the following claim: Wang, Si, Blanchet, Zhou Lemma 40 For any j [0, 2] sup α c u L (µ) L (µ) (c u L (µ))j/2(e2/c 1). J2(c) cj u j L (µ)(e2/c 1)2 dm Choose c = 2/ log 2 µ[w]2 = max {J1(c), J2(c)} max n cj u j L (µ), cj u j L (µ)(e2/c 1)2o dm 9 u j L (µ) which completes the proof. E.4.1 Proof of Lemma 40 Proof We bound v L (µ) = ess sup µ αj/2(e(u(s)+ u L (µ))/α 1) αj/2(e2 u L (µ)/α 1) Compute derivative: let β = 2 u L (µ)/α d dααj/2(e2 u L (µ)/α 1) = αj/2 1((eβ 1)j/2 βeβ) Notice that when β = 0, the above expression is 0. Moreover, for j [0, 2] d dβ ((eβ 1)j/2 βeβ) = (j/2 1 β)eβ < 0; i.e. (eβ 1)j/2 βeβ is decreasing. Therefore, for α > 0 d dααj/2(e2 u L (µ)/α 1) < 0; i.e. αj/2(e2 u L (µ)/α 1) is decreasing in α. Hence sup α c u L (µ) L (µ) sup α c u L (µ) αj/2(e2 u L (µ)/α 1) = (c u L (µ))j/2(e2/c 1) establishing the claim. Variance-Reduced Distributionally Robust Q-Learning E.5 Proof of Lemma 37 Proof Let u = u + u L (µ) and w = e u /α. sup α 0 α3 mn[w] µn(t)[w] + mn[uw] αµn(t)[w] mn[w]µn(t)[uw] = sup α 0 α3 mn[w ] µn(t)[w ] + mn[u w ] αµn(t)[w ] mn[w ]µn(t)[u w ] αµn(t)[w ]2 2 sup α 0 α3 mn[w ] µn(t)[w ] + mn[u w ] 2 + 2 sup α 0 αmn[w ]2µn(t)[u w ]2 =: 2OPT1 + 2OPT2 We first analyze OPT1. Fix c 0, we separately consider α c u L (µ) and α [0, c u L (µ)]. The first two terms sup α c u L (µ) α3 mn[w ] µn(t)[w ] + mn[u w ] 2 = sup α c u L (µ) α3 mn[(1 + u /α)w ]2 = sup α c u L (µ) mn[α3/2((1 + u /α)w 1)]2 µn(t)[w ]2 . Recall that 1 + x ex; i.e. (1 + u /α)w 1 0. Also, by Lemma 32, on Ωn,p(µ), p < µ , µn(t) µn µ. So, mn[α3/2((1 + u /α)w 1)]2 α3/2(1 (1 + u /α)w ) α3/2(eu /α (1 + u /α)) 2 Recall the Taylor series of ex. For all s S, we have that α3/2(eu (s)/α (1 + u (s)/α)) = Notice that k 3/2 > 0 for k 2 and the terms in the sum are non-negative. So, the above expression suggests that α α3/2(eu (s)/α (1 + u (s)/α)) is decreasing. Therefore, on Ωn,p(µ) sup α c u L (µ) α3 mn[w ] µn(t)[w ] + mn[u w ] 2 c3 u 3 L (µ)(e2/c 1)2 dmn dµn(t) Wang, Si, Blanchet, Zhou sup α [0,c u L (µ)] α3 mn[w ] µn(t)[w ] + mn[u w ] 2 sup α [0,c u L (µ)] µn(t)[w ]2 + αmn[u w ]2 2 sup α [0,c u L (µ)] α3 dmn dµn(t) L (µ) + α u 2 L (µ) 2(c3 + 4c) u 3 L (µ) Choose c = 2, we conclude that OPT1 32 u 3 L (µ) For OPT2, we use Lemma 35. OPT2 9 u L (µ) µn(t)[u w ]2 L (µ) u 2 L (µ) 36 u 3 L (µ) Therefore, we conclude that on Ωn,p(µ) sup α 0 α3 mn[w] µn(t)[w] + mn[uw] αµn(t)[w] mn[w]µn(t)[uw] 2 136 u 3 L (µ) The lemma follows from considering u κ, which won t change the left hand side. E.6 Proof of Lemma 36 Proof From Si et al. (2020), it is sufficient to consider α [0, δ 1 u L (µ)] =: K. For α > 0 fixed, tgn(t, α) = α mn[w] Also, for α = 0, by Lemma 32 and p 1 4µ , gn(t, 0) ess infµ u; hence tgn(t, 0) 0. Again by Lemma 32 µn(t) µ on Ωn,p(µ). So, the Radon-Nikodym theorem applies: For Variance-Reduced Distributionally Robust Q-Learning fixed t [0, 1], lim α 0 sup s (t ϵ) [0,1] | tgn(t, α)| lim α 0 sup t [0,1] α mn[w] µn(t)[w] = lim α 0 sup t [0,1] α 1 µn(t)[w]µn(t) dmn lim α 0 sup t [0,1] α dmn dµn(t) lim α 0 α µ p = 0. where we used H older s inequality to get the second last line. Therefore, tg( , ) is continuous on [0, 1] K. Next define Θ(t) := arg max α K g(t, α). To simplify notation, we use the w to denote w = w n(t) = e u/α n(t). We discuss two cases: 1. If u is µ-essentially constant with u L (µ) = u, then sup α K α log e u/α αδ = sup α K u αδ; i.e. Θ(t) = {0}. 2. u is not µ-essentially constant. Note that when α > 0, w > 0; we can define a new measure µ n(t)[ ] = µn(t)[w ] We have that α αgn(t, α) = µn(t)[u2w] α3µn(t)[w] + µn(t)[uw]2 α3µn(t)[w]2 = µ n(t)[u2] α3 + µ n(t)[u]2 = Varµ n(t)(u) i.e. gn(t, ) is strictly concave for α > 0. Also, recall that gn(t, ) is continuous at 0. So, in this case either Θ(t) = {0} or Θ(t) = {α n(t)} where δ 1 u L (µ) α n(t) > 0. In particular, Θ(t) is a singleton which we will denote by α n(t) in both cases. We conclude that by Shapiro et al. (2014) Theorem 7.21, the following derivative exists dt sup α K gn(t, α) = sup α Θ(t) tgn(t, α) = tgn(t, α n(t)). Wang, Si, Blanchet, Zhou Next, we analyze the second derivative. We prove that under Assumption 1, we have that on Ωn,p(µ) α = 0 or α > 0 will imply that α n(t) = 0 or α n(t) > 0 respectively. Let ρ = µ({y : u(y) = ess infµ u}) and ρn(t) the mixed version. Since µn µ, if ρ = 1 (thence α = 0), then we automatically have that ρn(t) 1 and α n(t) 0. Now we consider the case ρ = 1. Notice that by definition of Ωn,p(µ), ρ p ρn ρ+p. There are two cases: 1. α = 0. From Hu and Hong (2013), α = 0 iffρ e δ. If we want α n(t) = 0 for all t [0, 1], a sufficient condition is that ρn(t) ρ p e δ. 2. α > 0 iffρ < e δ. If we want α n(t) > 0 for all t [0, 1], a sufficient condition is that ρn(t) ρ + p < e δ. Therefore, for any e δ = ρ {µ({y : u(y) t}) : t R}, we can always choose p small enough s.t. for ω Ωn,p(µ), ρn(t) is close to ρ for all t and the above sufficient conditions hold. Remark 41 While this generalizes to all but finitely many δ, for simplicity of presentation, we assume Assumption 1 that µ /2 1 e δ. So, if ρ = 1, then 1 ρ µ > 1 e δ; i.e. ρ < e δ and case 1 cannot happen. Therefore, α = 0 iffu is µ essentially constant. Moreover, by our choice p 1 satisfying the sufficient condition in case 2. Hence our assumption on p implies that if α = 0 or α > 0, then on ω Ωn,p(µ), α n(t) = 0 or α n(t) > 0 for all t [0, 1] respectively. 1. α = 0, then gn(t, α n(t)) = gn(t, 0) is constant. Hence dtdtgn(t, α n(t)) = 0. 2. α > 0, then α n(t1), α n(t2) > 0. Since gn(t, ) is strictly convex, α n(t) is the unique solution to the first order optimality condition 0 = αgn(t, α n(t)) = log µn(t)[w] δ µn(t)[uw] α n(t)µn(t)[w]. (E.2) Note that αgn C ([0, 1] R++) and that α αgn(t, α n(t)) < 0. The implicit function theorem implies that α n(t) C1((0, 1)) with derivative dtα n(t) = t αgn(t, α n(t)) α αgn(t, α n(t)) Varµ n(t)(u) µn(t)[w] + µn(t)[uw]mn[w] α n(t)µn(t)[w]2 mn[uw] α n(t)µn(t)[w] We conclude that tgn(t, α n(t)) = α n(t) mn[w] Variance-Reduced Distributionally Robust Q-Learning is C1((0, 1)) as a function of t. Therefore, gn(t, α n(t)) is C2((0, 1)) with derivative dtdtgn(t, α n(t)) = dt tgn(t, α n(t)) = α n(t) mn[w]2 µn(t)[w]2 + dtα n(t) mn[w] µn(t)[w] + mn[uw] α n(t)µn(t)[w] mn[w]µn(t)[uw] α n(t)µn(t)[w]2 = α n(t) mn[w]2 µn(t)[w]2 α n(t)3 Varµ n(t)(u) µn(t)[w] + mn[uw] α n(t)µn(t)[w] mn[w]µn(t)[uw] α n(t)µn(t)[w]2 Therefore, Lemma 36 summarizes these two cases. E.7 Proof of Lemma 38 Proof First, we note that if µ n(t)(y) µn(t)(y) > 0, then by Lemma 32, µ n(t)(y) µn(t)(y) 3 4µ . So, we will only consider cases where µ n(t)(y) < µn(t)(y). We now fix any such y. By Lemma 36, under the given assumptions α > 0 implies that α n(t) > 0. So, the KL constraint is binding; i.e. δ = DKL(µ n(t)||µn(t)). By the log-sum inequality, δ = DKL(µ n(t)||µn(t)) µ n(t)(y) log µ n(t)(y) µn(t)(y) + (1 µ n(t))(y) log 1 µ n(t)(y) 1 µn(t)(y) kl(q, b) = q log q + (1 q) log 1 q where we think of b = µn(t)(y). Observe that for q (0, b) q qkl(q, b) = 1 q + 1 1 q > 0; i.e. kl( , b) is strictly convex and the maximum is achieved at q = 0, kl(0, b) = log(1/(1 b)). Since b [3 4µ ], we have that log(1/(1 b)) log(1/(1 3 4µ > δ. So, by the convexity, continuity of kl( , b) and kl(b, b) = 0, there is unique q (0, b) s.t. kl(q , b) = δ. Now we bound such q . Since dqkl(q, b) < 0 for q < b, by the fundamental theorem of calculus and convexity kl(q, b) = Z b q xkl(x, b)dx 2b(1 b) =: ζ(q, b) Wang, Si, Blanchet, Zhou Note that for q < b dbζ(q, b) = (b q)(q + b 2qb) 2(1 b)2b2 > 0 i.e. ζ(q, ) is increasing. Suppose to the contrary q < 1 kl(q , b) ζ(q , b) 4 µ ] ζ(q , b) However, by assumption, 1 24µ δ kl(q , µn(t)(y)) > µ . Hence q 1 2µ . We conclude that µ n(t) 1 Appendix F. The Empirical Robust Bellman Operator: χ2 Case To analyze the variance-reduced Q-learning for the χ2 case, we establish important statistical properties of the empirical DR ellman operator b T and its recentered version b H. We defer the proofs to Appendix H. The proof techniques are similar to that in Appendix C. We let b T be the empirical DR Bellman operator formed by n samples defined in (5.5). Define the recentered operators b H, H as in (A.1). We fix ˆq RS A. Proposition 42 Suppose Assumption 3 is enforced. Then |E[ b H(ˆq)(s, a) H(ˆq)(s, a)]| 26 ˆq q p n log(e|S|), provided n p 2 , and Var( b H(q ))(s, a) 211 ˆq q 2 p2 n log(e|S|) for all n 1. Proposition 43 Assume Assumption 3. Then w.p. at least 1 η H(ˆq) b H(ˆq) 6 ˆq q log(4|S|2|A|/η) provided that n 8p 2 log(4|S|2|A|/η) Proposition 44 The empirical DR Bellman operator b T(q) T (q) 8(rmax + γ q ) log (6|S||A|(|S| |R|)/η) w.p. at least 1 η, provided that n 8p 2 log (12|S||A|(|S| |R|)/η). Variance-Reduced Distributionally Robust Q-Learning Appendix G. Analysis of the Variance-Reduced Q-Learning: χ2 Case We proceed with the analysis of the variance-reduced DR Q-learning Algorithm 2 in the χ2 divergence case, similar to the KL case. Specifically, we aim to show that if the q-function from the last variance-reduced algorithm epoch, ˆql 1, is within a certain error b of the optimal q , then ˆql will have a better concentration bound by a geometric factor. This is summarized in Proposition 45, which is analogous to Proposition 10 in the KL case. Recall that Fl denotes the σ-field generated by the random samples used until the end of epoch l. We define the conditional expectation El 1[ ] = E[ |Fl 1]. Proposition 45 Assuming that Assumptions 2 and 3 are satisfied. On {ω : ˆql 1 q b} for some b 1/(1 γ), under measure Pl 1( ) := El 1[1 { }], we have that there exists numerical constant c s.t. ˆql q c b (1 γ)2kvr + b p (1 γ)3/2 nvrkvr + b p (1 γ) nvr log (3dkvr/η)2 + c 1 p (1 γ)2 ml w.p. at least 1 η, provided that ml 8p 2 log(24d/η) and nvr p 1 . Proof [Proof of Proposition 45] We recall the proof of Proposition 10 in Appendix B.3.1. We have that by (B.12), under Pl 1, on {ω : ˆql 1 q b} ql,k+1 q λk + Ql,k+1 + 2 Dl w.p.1. The sequence {Ql,j : j = 1, . . . , k + 1}, by (B.13), satisfies j=1 Ql,j + Ql,kvr+1 λkvr log(e + (1 γ)kvr) ζl 1 1 γ + σl 1 p log (4|S||A|kvr/η) w.p. at least 1 η, where we recall that ζl 1 = ˆql 1 q , σ2 l 1 = max (s,a) S A Varl 1(Hl,k(ˆql 1)(s, a)). Therefore, by Proposition 42, we have that j=1 Ql,j + Ql,kvr+1 c b (1 γ)2kvr + b p (1 γ)3/2 nvrkvr log (4|S||A|kvr/η)2 Wang, Si, Blanchet, Zhou for some constant c. Moreover, recall the definition of Dl in (B.14). By Propositions 42, 43, and 44, we have that Dl c rmax + |q |span + ˆql 1 q log (12d/η) + c ˆql 1 q c 1 p (1 γ) ml log (12d/η) + c b p nvr for some constant c that can change from line to line. Combining these bound with (G.1) and apply union bound, we conclude that ql,kvr+1 q c b (1 γ)2kvr + b p (1 γ)3/2 nvrkvr + b p (1 γ) nvr log (8dkvr/η)2 + c 1 p (1 γ)2 ml log (24d/η) w.p. at least 1 η. Recall the definition in Algorithm 2 that ql,kvr+1 = ˆql. Finally, we adjust the constant in the log factor using the inequality for C1 1, C2 e, log(C1C2) = log(C1) + log(C2) C1 log(C2). This completes the proof. Given Proposition 46, we apply the analysis techniques for the variance-reduction iterates in the proof of 46. This yields the following Proposition. Proposition 46 Assume Assumptions 2 and 3. For ϵ < (1 γ) 1, define parameters according to(5.6). Then, the statement of Proposition 11 hold; i.e. the sequence {ˆql, 0 l lvr} produced by Algorithm 2 satisfies the pathwise property that ˆql q 2 l(1 γ) 1 for all 0 l lvr w.p. at least 1 η. In particular, the final estimator ˆqlvr satisfies ˆqlvr q 2 lvr(1 γ) 1 w.p. at least 1 η. Proof [Proof of Proposition 46] Follow the proof of Proposition 11, we only to validate (B.16) given the parameter choice in (5.6). By Proposition 45, conditioned on ˆql 1 q 2 (l 1)(1 γ) 1 =: b ˆql q c b (1 γ)2kvr + b p (1 γ)3/2 nvrkvr + b p (1 γ) nvr log (3dkvr/η)2 + c 1 p (1 γ)2 ml w.p. at least 1 η. Therefore, it is easy to see that by the parameter choice (5.6), we have that for sufficiently large cvr and for events ω ˆql 1 q 2 (l 1)(1 γ) 1 , Pl 1 1 n ˆql q 2 l(1 γ) 1o (ω) 1 η validating (B.16). Following the same arguments as in proof of Proposition 11 will yield Proposition 46. Variance-Reduced Distributionally Robust Q-Learning Now, we prove Theorem 18. Proof [Proof of Theorem 18] By Proposition 46, under the parameter choice (5.6), ˆqlvr q ϵ w.p. at least 1 η. The total number of samples used is lvrnvrkvr + = O |S||A| 1 p2 (1 γ)4 + 4lvr This yields the sample complexity bound in Theorem 18. Appendix H. Proofs of Properties of the Empirical Bellman Operator: χ2 Case We first define some notations that mimic the definitions in Appendix C. Again, we override the notations for the KL case. For generic probability measure µ on (Y, 2Y ) and random variable u : Y R, let w = (α u)+; define the χ2 dual functional under the reference measure µ as f(µ, u, α) := α c(δ)µ[w2] 1 2 . (H.1) Recall the dual formulation of the DR Bellman operator (5.4), we have that T (q)(s, a) = sup β R f(νs,a, id, β) + γ sup α R f(ps,a, v(q), α). (H.2) Next, we present two important lemmas that underlie our analysis of the DR Bellman operator in the χ2 case. First, we characterize the optimal Lagrange multiplier in the dual formulation (5.4). Lemma 47 For δ > 0, f(µ, u, α) is second continuously differentiable and concave for α > ess infµ u. The supremum is achieved at ess infµ u α < , i.e. supα R f(µ, u, α) = f(µ, u, α ), satisfying µ[w2] = c(δ)2µ[w]2. (H.3) Moreover, if α > ess infµ u, then µ ( ) := µ[w1 { }] µ[w] . (H.4) is a worst-case measure satisfying µ [u] = f(µ, u, α ) = inf µ :Dχ2(µ µ) δ µ [u] = inf µ :Dχ2(µ µ)=δ µ [u]; i.e. the χ2 constraint is active. Finally, if α = ess infµ u, then the measure µ ( ) := µ[1 {U }] where U := {s : µ(s) > 0, u(s) = ess infµ u} is a worst-case measure. Wang, Si, Blanchet, Zhou With this lemma, we can show that under Assumption 3, the optimal Lagrange multiplier α is sufficiently large so that w = (α v)+ = α v a.s.µ. Lemma 48 If δ < 1 2µ := mins:µ(s)>0 µ(s), then α ess supµ u. Moreover, if u is not µ essentially constant, then α > ess supµ u. The proofs of these Lemmas are deferred to Appendix I. H.1 Proof of Proposition 42 As in Appendix C.5, call V := H(ˆq) b H(ˆq) = (T (ˆq) T (q )) (b T(ˆq) b T(q )). Recall the following notations in Appendix C.5: vt = tv(ˆq) + (1 t)v(q ), µ = p, m = p pn, and µ(t) = tp (1 t)pn. Let h(s, t) := sup α R f(µ(t), vs, α). We consider Ωn,p(µ) with p 1 4µ . Then, by Lemma 32, we have that µ µn µ(t) on Ωn,p(µ). Also, recall that α s,t is the optimal Lagrange multiplier that satisfies the conclusions of Lemma 47. First we note that if v(ˆq) and v(q ) are both µ essentially constant, then V = 0, and the claim of Proposition 42 holds trivially. Moving forward, we consider the case at least one of v(ˆq) and v(q ) is not µ essentially constant. We proceed to show the differentiability of h in this setting. This is summarized by Lemma 49. The proof of this result is deferred to Appendix I. Note that Assumption 3 implies that δ < 1 Lemma 49 Suppose δ < 1 4µ . If at least one of v(ˆq) and v(q ) is not µ essentially constant, then on Ωn,p(µ) there exists function s, t D2h(s, t) s.t. |V (s, a)| γ Z 1 D2h(s, t) dsdt (H.6) w.p.1, where D2h(s, t) = µ(t)[ vws]m[w2 s] 2µ(t)[ws]µ(t)[w2s] m[ vws] m[w2 s] 2µ(t)[w2s] m[ws] sα s,t = c(δ)2µ(t)[ws]µ(t)[ v] µ(t)[ws v] (c(δ)2 1)µ(t)[ws] (H.8) We analyze the two terms separately. Recall that ws 0. Similar to the techniques in Appendix C.5, we have that on Ωn,p(µ) with µ µn µ(t), µ(t)[ vws]m[w2 s] 2µ(t)[ws]µ(t)[w2s] v m[w2 s] 2µ(t)[w2s] Variance-Reduced Distributionally Robust Q-Learning and m[ vws] Hence on Ωn,p(µ), For D2, we note that sα s,t = c(δ)2µ(t)[ws]µ(t)[ v] µ(t)[ws v] (c(δ)2 1)µ(t)[ws] = 1 c(δ)2 1 c(δ)2µ(t)[ v] µ(t)[ws v] Next, we consider m[w2 s] 2µ(t)[w2s] m[ws] µ(t)[ws] = m[(ws µ(t)[ws])2] + 2m[ws]µ(t)[ws] 2µ(t)[w2s] m[ws] = m[(ws µ(t)[ws])2] 2c(δ)2µ(t)[ws]2 (c(δ)2 1)m[ws] c(δ)2µ(t)[ws] where we use the optimality condition (E.2) to replace µ(t)[w2 s] with c(δ)2µ(t)[ws]2. Then, m[(ws µ(t)[ws])2] = µ(t) dm dµ(t)(ws µ(t)[ws])2 µ(t) (ws µ(t)[ws])2 dm dµ(t) = µ(t)[w2 s] µ(t)[ws]2 dm dµ(t) = (c(δ)2 1)µ(t)[ws]2 dm dµ(t) where we also apply (E.2) and µ(t) µ. So, m[w2 s] 2µ(t)[w2s] m[ws] m[(ws µ(t)[ws])2] 2c(δ)2µ(t)[ws]2 + (c(δ)2 1)m[ws] c(δ)2µ(t)[ws] Wang, Si, Blanchet, Zhou Therefore, we have that |D2| = sα s,t m[w2 s] 2µ(t)[w2s] m[ws] as c(δ)2 = 1 + 2δ 1. So, on Ωn,p(µ), | s th(s, t)| |D1| + |D2| 9 Recall (C.16) and (C.17), we have that E|V | 2γ ˆq q p2n log(e|S|) + 5γ v sup s,t (0,1) E dm dµ(t) L (µ) 1Ωn,p(µ) µ2 n log(e|S|) + 5 ˆq q µ2 n log(e|S|) + 20 ˆq q p n log(e|S|) where we choose p = 1 4p and the last inequality follows from the assumption that n p 2 . To bound the variance, we use the same techniques as in (C.18) and conclude that for n 1 Var(b T(ˆq) b T(q )) 8γ2 ˆq q 2 P(Ωn,p(µ)c) + γ2E Z 1 0 2(D2 1 + D2 2)dsdt1Ωn,p(µ) 27 ˆq q 2 µ2 n log(e|S|) + 24 v 2 sup s,t (0,1) E dm dµ(t) L (µ) 1Ωn,p(µ) 27 ˆq q 2 µ2 n log(e|S|) + 210 ˆq q 2 µ n log(e|S|). 211 ˆq q 2 p n log(e|S|). This is the variance of H(ˆq) as H(ˆq) is deterministic. H.2 Proof of Proposition 43 Proof Given Lemma 49, we directly apply the arguments in Appendix C.6. Variance-Reduced Distributionally Robust Q-Learning We have that w.p.1, |V (s, a)| |V |1Ωn,p(µ)c + γ sup s,t (0,1) (|D1| + |D2)1Ωn,p(µ) where µ = ps,a. Recall the choice p 1 4ps,a . By Hoeffding s inequality and the union bound P(|V (s, a)| > t) P(Ωn,p(ps,a)c) + P γ sup s,t (0,1) (|D1| + |D2|) > t, Ωn,p(ps,a) P sup s S |ps,a,n(s ) ps,a(s )| > p + P 5γ ˆq q ps,a, p sup s S |m(s)| > t P (|m(s)| > p) + P 8 ˆq q ps,a, |m(s)| > t exp 2p2n + exp p2 s,a, t2n 32 ˆq q 2 Then, as p ps,a, for all (s, a) S A, by union bound P( V > t) 2|S|2|A| exp p2 n + exp p2 t2n 32γ2 ˆq q 2 We first control the first term to be less than η/2, which is implied by p2 log(4|S|2|A|/η). Finally, the second term less than η/2 is implied by choosing t2 = 32γ2 ˆq q p2 n log(4|S|2|A|/η). This proves the claimed result. H.3 Proof of Proposition 44 Proof We recall the bound (C.4). If v(q) is essentially constant w.r.t. ps,a, then b T(q)(s, a) = T (q)(s, a). Therefore, we then focus on the case that v(q) is not essentially constant. Again, we fix p 1 4µ and thus on Ωn,p(µ), µ µn where µ = νs,a or ps,a. So, if u is not essentially constant, by Assumption 3 and Lemma 48, we have that sup α R f(µn, u, α) = f(µn, u, α n), sup α R f(µ, u, α) = f(µ, u, α ) for some α n, α > ess supµ u =: u . Then, as in (C.4) we analyze sup α>u |f(µn, u, α) f(µ, u, α)| . Wang, Si, Blanchet, Zhou Since α > u , µ[w2] > 0 and f is differentiable in µ on Ωn,p(µ). By the mean value theorem, |f(µn, u, α) f(µ, u, α)| = c(δ)1 for some τ [0, 1] where we used (H.3) and µ(t) = tµ + (1 t)µn and m = µ µn. We first consider when α > 2 u , sup α>2 u |f(µn, u, α) f(µ, u, α)| m[α2 2αu + u2] αm[u] α µ(τ)[u] m[u2] α µ(τ)[u] (α µ(τ)[u])m[u] + µ(τ)[u]m[u] m[u2] α µ(τ)[u] (i) |m[u]| + µ(τ)[u] 2 m[u2] u u sup y Y |m(y)| where (i) uses that α 2 u and hence α µ(τ)[u] u . On the other hand, if u < α 2 u sup u <α 2 u |f(µn, u, α) f(µ, u, α)| sup u <α 2 u sup u <α 2 u m[α u] µ(τ)[α u] 2 u 1 µ p sup y Y |m(y)| µ sup y Y |m(y)| where the last two inequalities follow from Lemma 32 and p 1 Variance-Reduced Distributionally Robust Q-Learning Therefore, we have P sup α R |f(µn, u, α) f(µ, u, α)| > t P(Ωn,p(µ)c) + P sup α>v |f(µn, u, α) f(µ, u, α)| > t, Ωn,p(µ) P sup y |µn(y) µ(y)| > p + P µ sup y Y |mn(y)| > t exp( 2p2n) + exp µ2 t2n 2 u 2 exp( 2p2n) + exp µ2 t2n 2 u 2 where we used Hoeffding s inequality and union bound. Therefore, going back to the DR Bellman operator setting, we choose p = 1 4p . By union bound P( b T(q) T (q) > t) sup s,a sup β R |f(νs,a,n, id, β) f(νs,a, id, β)| > t + P sup s,a sup α R |f(ps,a,n, v(q), β) f(ps,a, v(q), β)| > t 2(|S|2|A| + |S||A||R|) exp p2 n + 2|S||A||R| exp p2 t2n 64r2max + 2|S|2|A| exp p2 t2n 64γ2 q 2 We set each of the three terms to be less than η/3 and find that it suffices to have p2 log (12|S||A|(|S| |R|)/η) t 8(rmax + γ |q|span) log (6|S||A|(|S| |R|)/η). This implies the statement of the proposition. Wang, Si, Blanchet, Zhou Appendix I. Proof of Technical Lemmas: χ2 Case I.1 Proof of Lemma 47 Proof First, we note that for every u and µ, f is continuous in α. Differentiate, we see that f(µ, u, ) is C1 with derivative αf(µ, u, α) = 1 c(δ)µ[w2] 1 2 µ[w] (I.1) which is again continuous. Differentiate again, we get that α αf(µ, u, α) = c(δ) µ[w2] 3 2 µ[w]2 µ[w2] 1 2 µ[1 {α > v}] = c(δ)µ[w2] 3 2 µ[w1 {α > v}]2 µ[w2]µ[1 {α > v}2] when α > ess infµ u, where (i) follows from Jensen s inequality. Moreover, this expression is continuous for α > Therefore, f is second differentiable and convex in α when α > ess infµ u. As we commented after Lemma 17, it suffices to optimize over α ess infµ u. By the continuity of f and αf in α and convexity, if the optimizer ess infµ u < α < , it must satisfies 0 = αf(µ, u, α ) = 1 c(δ)µ[w2] 1 which is (H.3). Next, we handle the boundary cases α = and α = ess infµ u. Notice that rewriting (H.3) as we see that for δ > 0, α = , because otherwise w µ[w] = 1 a.s.µ. and the above equality cannot hold. On the other hand, if α = ess infµ u, then (H.3) holds trivially with w = 0. Then, we show that (H.4) is a worst-case measure. It suffices to check that µ [u] = f(µ, u, α ) and Dχ2(µ µ) = δ. We have that µ [u] = µ[wu] = α µ[(α u)1 {α > u} (α u)] (i) = α c(δ)2µ[w] (ii) = α c(δ)µ[w2] 1 2 = f(µ, u, α ) Variance-Reduced Distributionally Robust Q-Learning where (i) and (ii) follows from (H.3). Moreover, by definition (5.1), Dχ2(µ µ) = 1 µ[w]2 + 1 2 again (i) follows from (H.3). Finally, clearly µ defined in (H.5) satisfies µ [u] = ess infµ u = f(µ, u, α ). So, to show that µ is a worst-case measure, it suffices to check that Dχ2(µ µ) δ. To show this, we observe that if α = ess infµ u, then by convexity we must have that for all sufficiently small ϵ > 0, αf(µ, u, α + ϵ) 0. Otherwise, α = ess infµ u cannot be optimal. In particular, let w(ϵ) = (α + ϵ u)+, then by (I.1), we have that µ[w(ϵ)2] c(δ)2µ[w(ϵ)]2. Note that if ϵ is sufficiently small, i.e. when α + ϵ < u(s) for all s / U and µ(s) > 0, then w(ϵ) = ϵ1U. Therefore, we must have that µ[1U] c(δ)2µ[1U]2; i.e. µ(U) 1 c(δ)2. With this bound, we now compute Dχ2(µ µ) = 1 Therefore, this proves Lemma 47. I.2 Proof of Lemma 48 Proof If u is µ essentially constant, then ess infµ u = ess supµ u = α ; i.e. the statement of Lemma 48 holds. Wang, Si, Blanchet, Zhou Next, we prove that if u is not µ essentially constant, then δ < 1 2µ implies α ess supµ u. To achieve this, we first show that α > ess infµ u under these assumptions. We prove this by assuming α = ess infµ u and raising a contradiction. By Lemma 47, µ defined in (H.5) is a worst-case measure. Hence, where (i) follows from the assumption that u is not µ essentially constant, so U = s : µ(s) > 0, u(s) = ess inf µ u cannot be of probability 1. In particular, by the definition of µ , µ(U) 1 µ . Therefore, rearrange terms, we have that δ µ 1 2µ , contradicting our assumption. Therefore, α > ess infµ u. Using this, we then show that if u is not µ essentially constant, δ < 1 2µ , and α > ess infµ u, then α ess supµ u. We prove by contradiction, assuming that ess infµ u < α ess supµ u. Since α ess supµ u, we must have that for some s S s.t. µ(s ) > 0, w(s ) = (α u(s ))+ = 0. By Lemma 47, µ defined in (H.4) is a worst-case measure when α > ess infµ u. Moreover, δ = Dχ2(µ µ) contradicting the assumption. Therefore, α > ess supµ u. This completes the proof of Lemma 48. I.3 Proof of Lemma 49 Proof By assumption, we are interested in empirical measures that satisfy Ωn,p(µ) (c.f. (C.3)) with p 1 4µ . Then, by Lemma 32, we have that µ µn µ(t) on Ωn,p(µ). Variance-Reduced Distributionally Robust Q-Learning We first fix s [0, 1]. Let us denote vs, := ess supµ vs. Recall that by Lemma 48, when δ < 1 2µ , it suffices to optimize the Lagrange multiplayer in [vs, , ). We have tf(µ(t), vs, α) = 1 2c(δ)m[w2 s]µ(t)[w2 s] 1 where ws = (α vs)+ = α v. It is not hard to see that tf(µ(t), vs, α) is continuous on [0, 1] [vs, , ) even if vs is essentially constant (in this case we note that tf(µ(t), vs, vs, ) = 0). Next define Θ(t) := arg max α>vs, f(µ(t), vs, α). We discuss two cases: 1. If vs is µ essentially constant, then for α > vs, f(µ(t), vs, α) = α c(δ)(α vs, ) = (1 c(δ))α + c(δ)vs, . Since c(δ) = 1 + 2δ > 0, this is maximized at Θ(t) = {vs, }. 2. vs is not µ essentially constant. Note that then by Lemma 48, α > ess supµ vs, α > vs a.s.µ (hence µ(t)). Recall that the second derivative in (I.2), α αf(µ(t), vs, α) = c(δ)µ(t)[w2 s] 3 2 µ(t)[ws1 {α > vs}]2 µ(t)[w2 s]µ(t)[1 {α > vs}2] = c(δ)µ(t)[w2 s] 3 2 µ(t)[ws]2 µ(t)[w2 s] where the last inequality follows from that ws is not µ(t) constant, hence the variance is positive. So, in this case f(µ(t), vs, ) is strictly concave. Thus, Θ(t) is a singleton. Therefore, in both case, Θ(t) is a singleton. We conclude that by Shapiro et al. (2014, Theorem 7.21), the following derivative exists dt sup α>vs, f(µ(t), vs, α) = sup α Θ(t) tf(µ(t), vs, α) = tf(µ(t), vs, α s,t) 2c(δ)m[w2 s]µ(t)[w2 s] 1 where it is understood that ws = (α s,t vs)+ = α s,t vs. Therefore, we have shown that t h(s, t) is C1(0, 1) C[0, 1]. Hence, |V (s, a)| = γ |h(1, 0) h(0, 0) h(1, 1) + h(0, 1)| 0 th(1, t) th(0, t)dt Wang, Si, Blanchet, Zhou Next, we show that for any fixed t, there exists a mapping s Ds th(s, t) s.t. (C.12) holds. We note that by Lemma 48, α s,t = vs, only when vs is essentially constant. Again, assuming that at least one of v(ˆq) and v(q ) is not µ essentially constant, as in the proof of Proposition 24, this can only happen at one particular s = s . We separately consider these two cases: Case 1: vs is never essentially constant for all s [0, 1]. In this case, α s,t > vs, for all s [0, 1]. Note that ws = α s,t vs > 0. So, if on Ωn,p(µ), α s,t is C1(0, 1) C[0, 1] in s, then by chain rule, s th(s, t) in (I.4) is C1(0, 1) C[0, 1]. As in the proof of Proposition 24, we show differentiability of s α s,t by invoking the implicit function theorem. By the strict convexity (I.3), α s,t is the unique solution to the optimality condition (H.3) 0 = c(δ)2µ(t)[ws]2 µ(t)[w2 s] =: F(s, α s,t). Since F is infinite smooth, the implicit function theorem implies that α s,t is C1(0, 1) C[0, 1] and s th(s, t) is C1(0, 1) C[0, 1]. We compute the derivative s th in this case. Recall v = v(ˆq) v(q ). Differentiate w.r.t. s on both side, we have 0 = c(δ2)2µ(t)[ws]µ(t)[ sα s,t v] 2µ(t)[ws( sα s,t v)]. Rearranging terms, we have sα s,t = c(δ)2µ(t)[w]µ(t)[ v] µ(t)[ws v] (c(δ)2 1)µ(t)[ws] This gives (H.8). Moreover, when α s,t > 0, s th(s, t) = 1 2c(δ)µ(t)[w2 s] 3 2 µ(t)[ vw2 s]m[w2 s] c(δ)µ(t)[w2 s] 1 c(δ)µ(t)[w2 s] 1 2 m[ws] + 1 2c(δ)µ(t)[w2 s] 3 2 µ(t)[ws]m[w2 s] = c(δ)µ(t)[w2 s] 1 2 µ(t)[ws] µ(t)[ vws]m[w2 s] 2µ(t)[ws]µ(t)[w2s] m[ vws] + c(δ)µ(t)[w2 s] 1 2 µ(t)[ws] sα s,t m[w2 s] 2µ(t)[w2s] m[ws] (i) = µ(t)[ vws]m[w2 s] 2µ(t)[ws]µ(t)[w2s] m[ vws] m[w2 s] 2µ(t)[w2s] m[ws] where (i) uses the optimality equation (H.3). This is consistent with (H.7). Case 2: There is a unique s [0, 1] s.t. vs is essentially constant. As in the proof of Proposition 24, in this case, the previous argument implies that s th(s, t) is C1(0, s ), C1(s , 1), and continuous at 0, 1. The derivative is also given by (H.7). Variance-Reduced Distributionally Robust Q-Learning Again, we show the existence of Ds th that satisfy (C.12). Observe that if s th(s, t) is continuous at s , then applying the fundamental theorem of calculus on the interval [0, s ] and [s , 1] separately, we will have that th(1, t) th(0, t) = Z s 0 s th(s, t)ds + Z 1 s s th(s, t)ds. Hence, taking Ds th(s, t) = s th(s, t) for every s = s and Ds th(s , t) = 0 will suffice to produce (C.12). It is left to check the continuity at s of th(s, t) = tf(µ(t), vs, α s,t) = 1 2c(δ)m[w2 s]µ(t)[w2 s] 1 from (I.4). Note that on Ωn,p(µ), for all s [0, 1], α vs, , 2c(δ)m[α vs]µ(t)[(α vs)2] 1 1 2µ(t)[(α vs)2] 1 2 dm dµ(t) (i) 1 2 α vs L (µ) 1 µ p 1 2 α vs L (µ) 1 3 4µ where (i) follows from Lemma 32. Also, th(s , t) = 0. Therefore, if sα s,t vs , as s s , then α vs L (µ) 0 as s s , implying continuity at s . It is left to check that sα s,t vs , as s s . To prove this, we assume to the contrary that there is a subsequential limit α sn,t β + vs , for some sequence sn s and β > 0. But by Lemma 47, we must have that 0 = lim n c(δ)2µ(t)[α sn,t vsn]2 µ(t)[(α sn,t vsn)2] = δβ raising a contradiction. This implies that s th(s, t) is continuous at s , and hence (C.12) holds with Ds th(s, t) = s th(s, t) for every s = s and Ds th(s , t) = 0. Therefore, in both cases (H.6) holds with D2h(s, t) is given by (H.7). This gives the claim of the lemma.