# distributionally_robust_qlearning__18799d8f.pdf Distributionally Robust Q-Learning Zijian Liu 1 Qinxun Bai 2 Jose Blanchet 3 Perry Dong 4 Wei Xu 2 Zhengqing Zhou 5 Zhengyuan Zhou 1 Reinforcement learning (RL) has demonstrated remarkable achievements in simulated environments. However, carrying this success to real environments requires the important attribute of robustness, which the existing RL algorithms often lack as they assume that the future deployment environment is the same as the training environment (i.e. simulator) in which the policy is learned. This assumption often does not hold due to the discrepancy between the simulator and the real environment and, as a result, and hence renders the learned policy fragile when deployed. In this paper, we propose a novel distributionally robust Q-learning algorithm that learns the best policy in the worst distributional perturbation of the environment. Our algorithm first transforms the infinite-dimensional learning problem (since the environment MDP perturbation lies in an infinite-dimensional space) into a finitedimensional dual problem and subsequently uses a multi-level Monte-Carlo scheme to approximate the dual value using samples from the simulator. Despite the complexity, we show that the resulting distributionally robust Q-learning algorithm asymptotically converges to optimal worst-case policy, thus making it robust to future environment changes. Simulation results further demonstrate its strong empirical robustness. 1. Introduction Reinforcement learning (RL) has demonstrated remarkable empirical success in simulated environments, with applications spanning domains such as robotics (Kober et al., 1New York University, Stern School of Business 2Horizon Robotics Inc., CA, USA 3Department of Management Science and Engineering, Stanford University, CA, USA 4Department of Electrical Engineering and Computer Sciences, UC Berkeley, CA, USA 5Arena Technologies. Correspondence to: Zhengyuan Zhou . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 2013; Gu et al., 2017), computer vision (Sadeghi & Levine, 2016; Huang et al., 2017), finance (Li et al., 2009; Choi et al., 2009; Deng et al., 2017) and games (Silver et al., 2016; 2018). More recently, reinforcement learning has also been applied to economic domains such as personalized promotions in revenue management, inventory control in supply chains and scheduling in queueing networks. However, carrying this success to real environments requires an attribute that is often missing in the existing literature on policy learning: robustness, because existing RL algorithms often make the implicit assumption that the environment in which the policy is trained will be the same as the environment in which the policy will be deployed. In other words, a policy is evaluated in the same environment from which the training data has been generated. While the above assumption is arguably a good starting point1 for developing a rigorous understanding of algorithmic performance of policy learning, it does not capture the complexity of real-world applications because the discrepancy between training and deployment environments is often common and hard to anticipate (and hence account for) in advance. Two common sources of discrepancies are: 1. Simulator model mis-specification. In many reinforcement learning (RL) applications, a policy is often first trained in a simulator before being deployed in a real environment. However, simulator models often cannot capture the full complexity of the real environment, and hence will be mis-specified. Further, it is hard to know exactly how the real environment differs from the simulator (otherwise, the simulator would have been augmented/modified to account for that). 2. Environment shifts. The underlying environment may shift due to either non-stationarity or a different deployment environment (for the same task). As an example of the latter, a personalized content recommendation 1Much like supervised learning was first developed under the same assumption, before adversarial training was recognized as a valuable tool to make empirical predictions robust; see, for instance, (Sinha et al., 2018; Goodfellow et al., 2014; Ganin et al., 2016; Zhang et al., 2019; Tram er et al., 2017). The topic of domain adaptation (Shimodaira, 2000; Saerens et al., 2002; Ben David et al., 2010; Courty et al., 2016; Ganin et al., 2016; Sun et al., 2020; Arjovsky et al., 2019; Wang & Deng, 2018; Sagawa et al., 2019) is another approach to address covariate shifts in supervised learning, but the new domain is often assumed to be known. Distributionally Robust Q-Learning engine (learned from existing user browsing data collected in one region or market) may need to be deployed in a different region or market, with different population characteristics. Another example occurs in robotics, where a robot trained to perform certain maneuvers (such as walking (Schulman et al., 2013) or folding laundry (Maitin-Shepard et al., 2010)) in an environment can fail catastrophically (Drew, 2015) in a slightly different environment, where the terrain landscape (in walking) is slightly altered or the laundry object (in laundry folding) is positioned differently. As a result, to learn an effective policy in practice, one must be cognizant of such discrepancies and take them into account during the training stage in order to learn a policy that is robust to the unknown environment changes that cannot be avoided and are difficult to know in advance. Whereas distributionally robust learning in the simpler supervised learning setting has been extensively studied in the past decade (see Section 1.1), an emerging literature has only recently been initiated to study this problem in the RL context (Si et al., 2020b; Zhou et al., 2021; Yang et al., 2021; Kishan & Kalathil, 2021). In particular, a common and natural formulation adopted therein is to consider distributional shifts for rewards and/or transition probabilities (defined via different divergence measures or distances between probability distributions), and to learn from data (collected from some generative model) the best policy under the worstcase distributional shift, which thus carries a certain level of robustness to the unknown environment shifts. Currently, all the existing distributionally robust policy learning algorithms mentioned above (Si et al., 2020b;a; Zhou et al., 2021; Yang et al., 2021; Kishan & Kalathil, 2021) are model-based, which estimate the underlying MDP first before provisioning some policy from it. Although model-based methods are often more sampleefficient and easier to analyze, their drawbacks are also wellunderstood (Sutton & Barto, 2018; Franc ois-Lavet et al., 2018): they are computationally intensive; they require more memory to store MDP models and often do not generalize well to non-tabular RL settings. These issues limit the practical applicability of model-based algorithms, which stand in contrast to model-free algorithms that learn to select actions without first learning an MDP model. Such methods are often more computationally efficient, have less storage overhead, and better generalize to RL with function approximation. In particular, Q-learning (Watkins & Dayan, 1992), as the prototypical model-free learning algorithm, has widely been both studied theoretically and deployed in practical applications. However, Q-learning is not robust (as demonstrated in our simulations), and the policy learned by Q-learning in one environment can perform poorly in another under a worst-case shift (with bounded magnitude). As such, we are naturally led to the following research ques- Can we design a variant of Q-Learning that is distributionally robust? 1.1. Related Work Robustness has been studied in several settings that are related to (but different from) our investigation. For instance, a rich literature has explored distributionally robust learning and optimization in supervised learning (Bertsimas & Sim, 2004; Delage & Ye, 2010; Hu & Hong, 2013; Shafieezadeh-Abadeh et al., 2015; Bayraksan & Love, 2015; Gao & Kleywegt, 2016; Namkoong & Duchi, 2016; Duchi et al., 2016; Staib & Jegelka, 2017; Shapiro, 2017; Lam & Zhou, 2017; Chen et al., 2018; Volpi et al., 2018; Lee & Raginsky, 2018; Nguyen et al., 2018; Yang, 2020; Mohajerin Esfahani & Kuhn, 2018; Zhao & Jiang, 2017; Abadeh et al., 2018; Zhao & Guan, 2018; Gao et al., 2018; Ghosh & Lam, 2019; Blanchet & Murthy, 2019; Duchi & Namkoong, 2018; Lam, 2019; Duchi et al., 2019), where the underlying testing distribution is still the same as the training distribution, and the learner merely uses the distributional uncertainty (around the empirical distribution) to guard against over-generalization due to lack of data. As such, the statistical learning results in this area focus on the setting where the distributional uncertainty decreases with the sample size (and as such, is part of the algorithm rather than the environment); further, it has been well recognized that under certain conditions, this approach is formally equivalent to regularization (Duchi & Namkoong, 2019; Duchi et al., 2016; Gao et al., 2017; Shafieezadeh-Abadeh et al., 2019), which prevents over-fitting in the small data regime. On the other hand, learning predictive rules in testing distributions that are different from and often perturbations of training distributions have also been studied in (Sinha et al., 2018; Goodfellow et al., 2014; Ganin et al., 2016; Zhang et al., 2019; Tram er et al., 2017), where the training procedure is robustified by first perturbing the original dataset with some synthesized noise before solving a empirical risk minimization problem, which has been observed to work well in a robust manner. Going beyond the supervised learning setting, a related area to ours is distributionally robust Markov decision processes (MDPs) (Gonz alez-Trejo et al., 2002; Iyengar, 2005; Xu & Mannor, 2010; El Ghaoui & Nilim, 2005; Wiesemann et al., 2013; Wolff et al., 2012; Mannor et al., 2016; Morimoto & Doya, 2005; Yang, 2020). This line of work has studied the known environment MDP setting (hence no learning) and has mainly focused on the computational issues2. Closest to 2For instance, they have characterized various types of uncertainty sets, and have shown that for almost all of them, the optimal distributionally robust policy is NP-hard to compute. For rectangular uncertainty sets (to which our formulation belongs), such Distributionally Robust Q-Learning our work is the recent distributionally robust policy learning work already mentioned (Si et al., 2020b; Zhou et al., 2021; Yang et al., 2021; Kishan & Kalathil, 2021): (Si et al., 2020b) studies the special case of distributionally robust policy learning in contextual bandits (i.e. horizon is 1), while the other three study the infinite-horizon discounted RL setting. 1.2. Our Contributions We design a distributionally robust Q-learning algorithm that has two features beyond the standard Q-learning algorithm. The first feature lies in the new values that the algorithm aims to learn: instead of Q-values, the algorithm now aims to estimate the distributionally robust Q-values. We achieve this by leveraging strong duality to transform the distributionally robust Q-values (an infinite-dimensional object since the distributional uncertainty set is infinitedimensional) into a finite-dimensional quantity, also known as the distributionally robust Bellman operator. Second, we design a novel multi-level Monte-Carlo estimator to unbiasedly estimate the distributionally robust Bellman operator. Through a careful analysis of our estimator s bias and variance (see Theorem 3.7 and Theorem 3.8), we show that the distributionally robust Q-learning algorithm asymptotically converges to optimal distributionally robust policy (Theorem 3.10). As such, our results provide an initial affirmative answer to the open question raised above, and the convergence result provides a distributionally robust counterpart to the well-known asymptotic convergence of Q-learning (Jaakkola et al., 1994). Finally, we provide simulation results to demonstrate the robustness of the policy learned by the proposed distributionally-robust Q-learning algorithm. 1.3. Comparison with Existing Work The distributionally robust Bellman equation was obtained before in (Iyengar, 2005). However, this is merely a tool for us to solve the problem in the dual space. Our main contribution is to design a distributionally robust Q-learning algorithm based on it. (Xu & Mannor, 2010) does not propose any new algorithm for learning a distributionally robust policy but only proved some properties of distributionally robust MDP. (Yang, 2020) considers the Wasserstein distance and uses a model-based algorithm that is totally different from ours. Our algorithm is the first model-free algorithm ever developed on this problem. Prior to our work, it is not clear at all whether Q-learning can be made robust, since all the existing algorithms in this area are model-based. computation can be done efficiently in polynomial time, although depending on how such uncertainty sets are specified, some of the proposed algorithms require oracle access to solving an infinitedimensional problem, and hence are infeasible. 2. Distributionally Robust Policy Learning with a Simulator 2.1. Standard Policy Learning Let M = (S, A, P, R, γ) be a tabular RL environment, where S and A are finite state space and action space respectively, R : S A P(R 0) (the set of random variables that are supported on R 0) is the randomized reward function, P = {ps,a( )}(s,a) S A is the transition model, and γ (0, 1) is the discount factor. We assume that the transition is Markovian, i.e., at each state s S, if action a A is chosen, then the subsequent state is determined by the conditional distribution ps,a( ) = p( |s, a). The decision maker will therefore receive a randomized reward r(s, a). The value function V π(s) provides the expected cumulative discounted reward under the policy π with initial state s S, t=1 γt 1r(st, at) s1 = s Hence, the optimal value function is: V (s) := max π Π V π(s), s S, where Π denotes the class of random policies. It is well known that the optimal value function is the solution of the following Bellman s equation: V (s) = max a A E [r(s, a)] + γEps,a [V (s )] , s S. From here we define the optimal Q-function, Q as Q (s, a) = E [r(s, a)] + γEps,a [V (s )] = E [r(s, a)] + γEps,a max a A Q (s , a ) . Throughout this paper, we impose the following assumption on the rewards. Assumption 2.1. (Bounded rewards) For any (s, a) S A, r(s, a) [0, Rmax]. 2.2. A Distributionally Robust Formulation In reality, the transition model P and rewards R in M are subjected to change over time, this motivates us to learn a policy that is robust to certain perturbation in the environment. In particular, we consider the setting of distributionally robust RL, where both transition probabilities and rewards might be perturbed w.r.t. the Kullback-Leibler (KL) divergence DKL(P Q) = R log d P d Q d P whenever P Q (P is absolutely continuous with respect to Q). Remark 2.2. We pick KL divergence simply because it is one common divergence measure that is easy to analyze (given the already complicated nature of the problem). Our results can also be generalized to f-divergence. Distributionally Robust Q-Learning In the original environment, let P0 = {p0 s,a}(s,a) S A be the transition probabilities, ν0 be the joint distribution of {r(s, a)}(s,a) S A, where r(s, a) ν0 s,a (marginal distribution with respect to (s, a)). To construct the distributional uncertainty setting, for each (s, a) S A, we define robust KL balls that are centered at p0 s,a and ν0 s,a by Ps,a(δ) := ps,a |S| : DKL ps,a p0 s,a δ Rs,a(δ) := r(s, a) νs,a : DKL(νs,a ν0 s,a) δ respectively. Here δ > 0 is the level of distributional robustness, and |S| stands for the |S| 1 dimensional probability simplex. Now we are able to build the uncertainty set P(δ) by the Cartesian product of Ps,a(δ) for each (s, a) S A. This type of uncertainty set is called (s, a) rectangular set in standard literature (Wiesemann et al., 2013). Similarly we define R(δ) by the Cartesian product of Rs,a(δ) for each (s, a) S A. In the distributionally robust framework, the adversarial player is assumed to pick the worst-case transition model and rewards that minimize the expected cumulative discounted reward. To be clear, we define the distributionally robust value function as follows. Definition 2.3. Given δ > 0 and policy π Π, the distributionally robust value function V rob,π δ is defined as: V rob,π δ (s) := inf p P(δ),r R(δ) Ep,r t=1 γt 1r(st, at) Following the definition, V rob,π δ measures the quality of a policy π by computing its performance in the worst-case environment among the set of all possible environments that perturb the original transition P0 and reward distribution ν0 under a δ-KL ball. The optimal distributionally robust value function is therefore defined by V rob, δ (s) := max π Π V rob,π δ (s), s S. 2.3. Strong Duality Following the well known results in (Iyengar, 2005; Xu & Mannor, 2010), we can write down the distributionally robust dynamic programing for the distributionally robust value function V rob,π δ in Equation (1) as follows: V rob,π δ (s) = inf ps,π(s) Ps,π(s)(δ), r Rs,π(s)(δ) n E[r(s, π(s))] + γEps,π(s) h V rob,π δ (s ) io . Moreover, we can write down the distributionally robust Bellman s equation for the optimal value function as follows: V rob, δ (s) = max a A inf ps,a Ps,a(δ), r Rs,a(δ) n E [r(s, a)] + γEps,a h V rob, δ (s ) io . Note that both Equation (2) and Equation (3) are in general computationally intractable since they involve infinite dimensional optimization. To address this issue, we introduce the following strong duality lemma from distributionally robust optimization under KL-perturbation. Lemma 2.4 ((Hu & Hong, 2013), Theorem 1). Suppose H(X) has finite moment generating function in the neighborhood of zero. Then for any δ > 0, sup P :DKL(P P0) δ EP [H(X)] n α log EP0 h e H(X)/αi + αδ o . By Lemma 2.4, we can transform the Equation (2) to the following equation. V rob,π δ (s) = n α log Eν0 s,π(s) h e r(s,π(s))/αi αδ o + n β log Ep0 s,π(s) h e V rob,π δ (s )/βi βδ o . As a direct consequence of the Equation (4) (note that the size of Π is finite in the tabular setting, see Theorem 3.2 in (Iyengar, 2005) for a standard proof), the optimal distributionally robust value function V rob, δ in fact satisfies the following distributionally robust Bellman s equation. V rob, δ (s) = n α log Eν0s,a h e r(s,a)/αi αδ o + n β log Ep0 s,a h e V rob, δ (s )/βi βδ o) 3. Q-Learning in Distributionally Robust RL 3.1. Review of Q-Learning The Q-learning algorithm determines the optimal Qfunction using point samples. Let π be some random policy such that P(π(s) = a) > 0 for all state-action pairs (s, a) S A. Suppose at time t, we draw a sample Distributionally Robust Q-Learning (st, at, rt, s t) from the environment according to the policy π. Then, Q-learning uses the following update rules Qt+1(st, at) =Qt(st, at) + αt(st, at) rt + γ max b A Qt(s t, b) Qt(st, at) =(1 αt(st, at))Qt(st, at) + αt(st, at) rt + γ max b A Qt(s t, b) , where the step-sizes αt will be properly chosen. The key reason why Q-learning works is that both rt and maxb A Qt(s t, b) are unbiased estimators of E[r(st, at)] and Epst,at [maxb A Qt(s , b)]. Therefore we can use the stochastic approximation theorem to show that Qt converges to Q with careful choice of step-sizes. For more details, please read (Melo, 2001; Even-Dar et al., 2003). 3.2. Distributionally Robust Q-Learning From Equation (3) and Equation (5), we know the optimal distributionally robust Q-function Qrob, δ satisfies Qrob, δ (s, a) = inf ps,a Ps,a(δ), r Rs,a(δ) E[r(s, a)] + γEps,a max b A Qrob, δ (s , b) n α log Eν0s,a h e r(s,a)/αi αδ o + n β log Ep0s,a h e maxb A Qrob, δ (s ,b)/βi βδ o . By analogy with the Bellman Operator, we can define the δ-distributionally robust Bellman Operator as follows: Definition 3.1. Given δ > 0 and Q RS A, the δdistributionally robust Bellman Operator T rob δ : RS A RS A is defined as T rob δ (Q)(s, a) := n α log Eν0s,a h e r(s,a)/αi αδ o + n β log Ep0 s,a h e maxb A Q(s ,b)/βi βδ o . Remark 3.2. We define the δ-distributionally robust Bellman Operator by using its dual form. However, it may be more convenient to use the primal form T rob δ (Q)(s, a) = inf ps,a Ps,a(δ), r Rs,a(δ) E[r(s, a)] + γEps,a max b A Q(s , b) . Our definition implies that Qrob, δ is a fixed point of T rob δ . Now suppose we have a simulator that allows us to sample (r, s ) from (ν0 s,a, p0 s,a), can we come up with a nice unbiased estimator of T rob δ (Q)(s, a)? The plug-in estimator of T rob δ (Q)(s, a) is in fact biased (because of the non-linearity). For instance, if we just take one sample, say (s, a, r, s ). The corresponding plug-in estimator of T rob δ (Q)(s, a) is n α log e r/α αδ o + n β log e maxb A Q(s ,b)/β βδ o =r + γ max b A Q(s , b), which is the same as what we have in standard Q-learning. It is obviously not an unbiased estimator of T rob δ (Q)(s, a). To address this issue, we propose a new estimator of T rob δ by introducing the multilevel Monte-Carlo method (Blanchet & Glynn, 2015; Blanchet et al., 2019). The formal description of our estimator is defined as follows: Definition 3.3. Given δ > 0, ε (0, 0.5) and Q RS A, the δ-distributionally robust estimator b T rob δ,ε is defined as: b T rob δ,ε (Q)(s, a) := b Rrob δ (s, a) + γ b T rob δ (Q)(s, a). (8) For b Rrob δ (s, a) and b T rob δ (s, a), we firstly sample N N from the distribution P(N = n) = pn = ε(1 ε)n, then use the simulator to draw 2N+1 samples (ri, s i) from (ν0 s,a, p0 s,a). Finally we define b Rrob δ (s, a) := r1 + r N,δ p N , . (9) b T rob δ (Q)(s, a) := max b A Q(s 1, b) + q N,δ(Q) 1 2 sup α 0 i=1 e r2i/α 1 2 sup α 0 i=1 e r2i 1/α Distributionally Robust Q-Learning q N,δ(Q) := i=1 e maxb A Q(s i,b)/β 1 2 sup β 0 i=1 e maxb A Q(s 2i,b)/β 1 2 sup β 0 i=1 e maxb A Q(s 2i 1,b)/β Using the estimator b T rob δ,ε , our Distributionally Robust QLearning algorithm is summarized in Algorithm 1. Algorithm 1 Distributionally Robust Q-Learning Input: Uncertainty radius δ > 0, parameter ε (0, 0.5). Initialization: b Qrob δ,t (s, a) 0, (s, a) S A, t = 1. repeat for every (s, a) S A do Sample N N from P(N = n) = pn = ε(1 ε)n. Draw 2N+1 samples (ri, s i) from the simulator. Compute r N,δ and q N,δ( b Qrob δ,t ) through Equation (11) and Equation (12) respectively. b Rrob δ (s, a) r1 + r N,δ p N b T rob δ ( b Qrob δ,t )(s, a) max b A b Qrob δ,t (s 1, b) + q N,δ( b Qrob δ,t ) p N b T rob δ,ε ( b Qrob δ,t )(s, a) b Rrob δ (s, a) + γ b T rob δ ( b Qrob δ,t )(s, a) b Qrob δ,t+1(s, a) (1 αt) b Qrob δ,t (s, a) + αt b T rob δ,ε ( b Qrob δ,t )(s, a) end for t t + 1. until b Qrob δ,t converges Remark 3.4. In Algorithm 1, we can choose any αt which satisfies the Robbins Monro Condition, i.e., P i=1 αt = , P i=1 α2 t < . Remark 3.5. δ is an exogenous variable that quantifies the level of conservatism, which we consider as pre-determined (and not part of the algorithm). That said, it can also be from past datasets if they are collected from different environments, in which case the shift can be estimated. 3.3. Theoretical Guarantee First of all, we introduce Lemma 3.6 which plays a critical role in the proof of convergence of Algorithm 1. Lemma 3.6. Given δ > 0, 0 < γ < 1, T rob δ is a γcontraction map w.r.t. the infinity norm. Proof. Given any Q1, Q2 RS A, let Vi(s ) = maxb A Qi(s , b) for i {1, 2}. Fix a pair (s, a) S A. By applying the primal form of T rob δ (Equation (7)), we have T rob δ (Q1)(s, a) T rob δ (Q2)(s, a) = γ inf ps,a Ps,a(δ) Eps,a [V1(s )] inf ps,a Ps,a(δ) Eps,a [V2(s )] . Consider any ϵ > 0, suppose p(ϵ) Ps,a(δ) makes Ep(ϵ) [V2(s )] infps,a Ps,a(δ) Eps,a [V2(s )] + ϵ. Then we know inf ps,a Ps,a(δ) Eps,a [V1(s )] inf ps,a Ps,a(δ) Eps,a [V2(s )] Ep(ϵ) [V1(s ) V2(s )] + ϵ max s S |V1(s ) V2(s )| + ϵ max (s ,b) S A |Q1(s , b) Q2(s , b)| + ϵ = Q1 Q2 + ϵ. Combining the previous part, we have ϵ > 0, T rob δ (Q1)(s, a) T rob δ (Q2)(s, a) γ Q1 Q2 + γϵ, which implies T rob δ (Q1)(s, a) T rob δ (Q2)(s, a) γ Q1 Q2 . By a similar argument, we also have T rob δ (Q2)(s, a) T rob δ (Q1)(s, a) γ Q1 Q2 . Hence, there is |T rob δ (Q1)(s, a) T rob δ (Q2)(s, a)| γ Q1 Q2 . Note that the above result is true for any (s, a) S A, which implies T rob δ (Q1) T rob δ (Q2) γ Q1 Q2 . Next, we give the key theorem of our estimator b T rob δ,ε . Theorem 3.7. Given δ > 0. If Assumption 2.1 holds, then for any ε (0, 0.5), b T rob δ,ε is an unbiased estimator of T rob δ , i.e., Q RS A, (s, a) S A, E h b T rob δ,ε (Q)(s, a) i = T rob δ (Q)(s, a). Distributionally Robust Q-Learning Due to the page limitation, we defer the whole proof of Theorem 3.7 into Appendix B and give a proof outline here. By the law of total probability, we first can get E[ b Rrob δ (s, a)] = lim n sup α 0 i=1 e ri/α ! a similar result will hold for E[ b T rob δ (Q)(s, a)]. One key technical point helping us to move on is to understand α = argmax α 0 n α log Eν0s,a h e r(s,a)/αi o , and β defined in a similar way. By employing different events defined in Appendix B, we finally can derive lim n sup α 0 i=1 e ri/α ! n α log Eν0s,a h e r(s,a)/αi o , which completes the unbiasedness of b Rrob δ (s, a). Similar result will give us the unbiasedness of b T rob δ (Q)(s, a). Besides, we introduce another key property of b T rob δ,ε . Theorem 3.8. Given δ > 0, ε (0, 0.5). If Assumption 2.1 holds, then there exists a constant C(δ, ε, ν0, P0) > 0 such that for any Q RS A and (s, a) S A, Var h b T rob δ,ε (Q)(s, a) i C(δ, ε, ν0, P0)(1 + Q 2 ). The proof of Theorem 3.8 is also deferred to Appendix C. Remark 3.9. For Theorem 3.8, our setting is different from (Blanchet & Glynn, 2015; Blanchet et al., 2019), which requires a careful analysis and a different proof strategy. One key difference is that expectation terms in Equation (6) are inside the log function. However, in (Blanchet & Glynn, 2015; Blanchet et al., 2019), expectations are always at the most outside. The non-linearity of the log function makes the analysis more difficult. Another technical point is we need to understand the optimizers of T rob δ , i.e., α = argmax α 0 n α log E h e r(s,a)/αi αδ o , β = argmax β 0 n β log Ep0s,a h e maxb A Q(s ,b)/βi βδ o , where proof techniques are different for α , β = 0 and α , β = 0. Finally, combing previous results, we are able to establish the following convergence guarantee for Algorithm 1. Theorem 3.10. Given δ > 0. If Assumption 2.1 holds, then for any ε (0, 0.5), b Qrob δ,t in Algorithm 1 will converge to Qrob, δ as t . Proof. Following the proof in (Melo, 2001; Even-Dar et al., 2003), we can rewrite the update rule of Algorithm 1 as D( b Qrob δ,t+1)(s, a) = (1 αt)D( b Qrob δ,t )(s, a) + αt(b T rob δ,ε ( b Qrob δ,t ) Qrob, δ )(s, a), where D(Q) := Q Qrob, δ . Combining Lemma 3.6, Theorem 3.7 and the fact that T rob δ (Qrob, δ ) = Qrob, δ , we have E h (b T rob δ,ε ( b Qrob δ,t ) Qrob, δ )(s, a) i = (T rob δ ( b Qrob δ,t ) Qrob, δ )(s, a) = (T rob δ ( b Qrob δ,t ) T rob δ (Qrob, δ ))(s, a) γ D( b Qrob δ,t ) . Next, by Theorem 3.8, we have Var h (b T rob δ,ε ( b Qrob δ,t ) Qrob, δ )(s, a) i =Var h b T rob δ,ε ( b Qrob δ,t )(s, a) i C(δ, ε, ν0, P0)(1 + b Qrob δ,t 2 ) =C(δ, ε, ν0, P0)(1 + D( b Qrob δ,t ) + Qrob, δ 2 ) 2(1 + Qrob, δ 2 )C(δ, ε, ν0, P0)(1 + D( b Qrob δ,t ) 2 ). Also note that αt satisfies the Robbins Monro Condition, these intermediate results then yield the convergence of Algorithm 1 by Theorem 1 of (Jaakkola et al., 1994). 4. Numerical Experiments We use a supply chain model to test Algorithm 1 in our numerical experiments. In the supply chain model, the state space S = [n] := {0, 1, , n 1, n} which represents the possible number of goods we have. The action space A = [n] represents the number of goods we can order. In every day t N+, the number of goods demanded by the market is an unobserved random variable dt Uni[n]. We assume {dt}t N+ are multiple independent. At the start of every day t, suppose the number of goods is st, then we can determine the number of goods we want to order, i.e., our action at [n st]. Besides, we assume there is no delay for us to receive our order. So the cost ct at day t is ct = k1[at > 0] + h(st + at dt)+ + p(dt st at)+, h and p are pre-specified constants denoting holding cost for every single good and per unit lost sales penalty, respectively. Distributionally Robust Q-Learning Table 1. Distributionally robust policy for different ε ε bπrob δ,ε (0) bπrob δ,ε (1) bπrob δ,ε (2) bπrob δ,ε (3) bπrob δ,ε (4) bπrob δ,ε (s 5) 0.49 7 6 5 4 3 0 0.499 7 6 5 4 3 0 0.5 7 6 5 4 3 0 0.6 7 6 5 4 3 0 We can decompose ct as two parts h(st + at dt)+ and p(dt st at)+. A holding cost of h(st + at dt)+ will appear on remaining goods and a lost sales penalty of p(dt st at)+ is incurred if the number of demand could not be served due to insufficient goods. k is a fixed ordering cost. In our experiments, due to the limit computation resources, we fix n = 10. Besides, we take h = 1, p = 2, k = 3 and set the discount factor γ = 0.9 with starting from s1 = 0. By using standard Q-learning, one can find the non distributionally robust, optimal policy π for the non perturbed problem, i.e., δ = 0, satisfying ( 8 s 0 s 2 0 3 s 10 . For the distributionally robust setting, we can think the probability distribution of dt is no longer a uniform distribution anymore, which is quite reasonable in reality. In the simulation, we set δ = 1 as the perturbation parameter. At the k-th step of Algorithm 1, we set the learning rate αk be 1 1+(1 γ)(k 1) to satisfy the Robbins Monro Condition. We will treat Algorithm 1 as having converged when the infinity norm of the difference between the updated value and the old value is no greater than tolerance = 0.05. For the parameter ε used in our estimator, we consider ε {0.49, 0.499.0.5, 0.6}. Note that our Theorem 3.10 only ensures the convergence for ε (0, 0.5), but we will test some values out of this range. Besides, the reason why we only choose {0.49, 0.499} rather than adding some other smaller ε (0.0.5) is that from the construction of our estimator, we can find smaller ε will lead to a huge number of samples with high probability, to avoid this problem, we only test ε close to 0.5. To reduce the effect of variance, for every ε, we will run 5 times and use the averaged value as the final output b Qrob δ,ε . Finally, we use b Qrob δ,ε to find the greedy policy bπrob δ,ε (s) = argmax a [n s] b Qrob δ,ε (s, a), s S for every ε. The output greedy policy bπrob δ,ε for every ε is listed in Table 1. We can see all policies are the same as bπrob δ (s) = ( 7 s 0 s 4 0 5 s 10 . In Table 2, we list the averaged number of iterations and averaged number of samples used for every ε. Combing our results in Tables 1 and 2, it shows that Algorithm 1 may converge with even less samples when ε / (0, 0.5) . Table 2. Averaged number of iterations and samples used for different ε. ε #ITERATIONS #SAMPLES 0.49 1869.6 2959020.8 0.499 1815.4 2398951.2 0.5 1864.6 2478958.8 0.6 1864.8 820703.6 Now we define a perturbed uniform distribution with parameter m and b as follows: Unim,b([n])(x) = ( b+1 n+1 x {m, m + 1} n 1 2b n2 1 x / {m, m + 1} . With the perturbed distribution, we test our distributionally robuast policy bπrob δ and non distributionally robuast policy π for b {1, 1.5, 2, 2.5} (Note that b can not be too large, otherwise there will exist some pair (s, a) such that DKL(ps,a p0 s,a) > δ or DKL(νs,a ν0 s,a) > δ) and every m [n 1]. We report the total cost averaged over 2000 runs for different b in Figures 1 to 4. With varying test probabilities, our distributionally robust policy performs better in worst cases when the probability distribution of demand is centered in {5, 6, 7} instead of the uniform distribution, demonstrating again the effectiveness of our proposed distributionally robust formulations. 5. Conclusion In this paper, we proposed a novel unbiased estimator of the distributionally robust Bellman Operator. By using the Distributionally Robust Q-Learning Figure 1. Total Cost for b = 1 Figure 2. Total Cost for b = 1.5 Figure 3. Total Cost for b = 2 Figure 4. Total Cost for b = 2.5 estimator, we proposed a novel Q-learning algorithm, distributionally robust Q-learning, that is able to learn a robust Q value function under the KL-divergence perturbation of transition probabilities and rewards. We established the asymptotic convergence guarantee of the proposed distributionally robust Q-learning algorithm. Several open problems suggest itself. First, although asymptotic convergence is desirable, it would also be interesting to obtain finite-sample guarantees for distributionally robust Q-learning algorithm. We believe such a result would require a completely different line of analysis, whose scope goes significantly beyond this paper. Second, this paper focused on the infinite-horizon discounted RL setting. A much more challenging but also highly useful setting to consider is the (infinite horizon) average reward RL (Dong et al., 2021). RL in this setting is less explored, and distributionally robust policy learning in this setting poses significant technical challenges. Finally, another important direction of research is to generalize the results to high-dimensional state settings (Ren & Zhou, 2020), where the intrinsic dimension is low. Data efficiency in such settings will be of particular importance. We look forward to these problems being addressed by the emerging distributionally robust reinforcement learning community. Acknowledgements This work is generously supported by the Horizon Robotics faculty award. This work is additionally supported by the Air Force Office of Scientific Research under award number FA9550-20-1-0397 and NSF grants 1915967 and 2118199. Zijian Liu would like to thank Junfu Yao for discussions. Distributionally Robust Q-Learning Abadeh, S. S., Nguyen, V. A., Kuhn, D., and Esfahani, P. M. Wasserstein distributionally robust kalman filtering. In Advances in Neural Information Processing Systems, pp. 8483 8492, 2018. Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Bayraksan, G. and Love, D. K. Data-driven stochastic programming using phi-divergences. In The Operations Research Revolution, pp. 1 19. Catonsville: Institute for Operations Research and the Management Sciences, 2015. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79(1):151 175, 2010. Bertsimas, D. and Sim, M. The price of robustness. Operations Research, 52(1):35 53, 2004. Blanchet, J. and Murthy, K. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2):565 600, 2019. doi: 10.1287/moor. 2018.0936. Blanchet, J. H. and Glynn, P. W. Unbiased monte carlo for optimization and functions of expectations via multi-level randomization. In 2015 Winter Simulation Conference (WSC), pp. 3656 3667. IEEE, 2015. Blanchet, J. H., Glynn, P. W., and Pei, Y. Unbiased multilevel monte carlo: Stochastic optimization, steady-state simulation, quantiles, and other applications. ar Xiv preprint ar Xiv:1904.09929, 2019. Chen, Z., Kuhn, D., and Wiesemann, W. Data-driven chance constrained programs over wasserstein balls. ar Xiv preprint ar Xiv:1809.00210, 2018. Choi, J. J., Laibson, D., Madrian, B. C., and Metrick, A. Reinforcement learning and savings behavior. The Journal of finance, 64(6):2515 2534, 2009. Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9): 1853 1865, 2016. Delage, E. and Ye, Y. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595 612, 2010. Deng, Y., Bao, F., Kong, Y., Ren, Z., and Dai, Q. Deep direct reinforcement learning for financial signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems, 28(3):653 664, 2017. Dong, S., Van Roy, B., and Zhou, Z. Simple agent, complex environment: Efficient reinforcement learning with agent states. ar Xiv preprint ar Xiv:2102.05261, 2021. Drew, K. California robot teaching itself to walk like a human toddler. NBC News, Dec 2015. Duchi, J. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750, 2018. Duchi, J. and Namkoong, H. Variance-based regularization with convex objectives. The Journal of Machine Learning Research, 20(1):2450 2504, 2019. Duchi, J., Glynn, P., and Namkoong, H. Statistics of robust optimization: A generalized empirical likelihood approach. ar Xiv preprint ar Xiv:1610.03425, 2016. Duchi, J., Hashimoto, T., and Namkoong, H. Distributionally robust losses against mixture covariate shifts. ar Xiv preprint ar Xiv:2007.13982, 2019. El Ghaoui, L. and Nilim, A. Robust solutions to markov decision problems with uncertain transition matrices. Operations Research, 53(5):780 798, 2005. Even-Dar, E., Mansour, Y., and Bartlett, P. Learning rates for q-learning. Journal of machine learning Research, 5 (1), 2003. Fournier, N. and Guillin, A. On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707 738, 2015. Franc ois-Lavet, V., Henderson, P., Islam, R., Bellemare, M. G., and Pineau, J. An introduction to deep reinforcement learning. ar Xiv preprint ar Xiv:1811.12560, 2018. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096 2030, 2016. Gao, R. and Kleywegt, A. J. Distributionally robust stochastic optimization with Wasserstein distance. ar Xiv preprint ar Xiv:1604.02199, 2016. Gao, R., Chen, X., and Kleywegt, A. J. Wasserstein distributional robustness and regularization in statistical learning. ar Xiv e-prints, pp. ar Xiv 1712, 2017. Gao, R., Xie, L., Xie, Y., and Xu, H. Robust hypothesis testing using wasserstein uncertainty sets. In Advances in Neural Information Processing Systems, pp. 7902 7912, 2018. Distributionally Robust Q-Learning Ghosh, S. and Lam, H. Robust analysis in stochastic simulation: Computation and performance guarantees. Operations Research, 2019. Gonz alez-Trejo, J., Hern andez-Lerma, O., and Hoyos Reyes, L. F. Minimax control of discrete-time stochastic systems. SIAM Journal on Control and Optimization, 41 (5):1626 1659, 2002. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014. Gu, S., Holly, E., Lillicrap, T., and Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389 3396. IEEE, 2017. Hu, Z. and Hong, L. J. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 2013. Huang, C., Lucey, S., and Ramanan, D. Learning policies for adaptive tracking with deep feature cascades. ICCV, pp. 105 114, 2017. Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 30(2):257 280, 2005. Jaakkola, T., Jordan, M. I., and Singh, S. P. On the convergence of stochastic iterative dynamic programming algorithms. Neural computation, 6(6):1185 1201, 1994. Kishan, P. and Kalathil, D. Sample complexity of robust reinforcement learning with a generative model. ar Xiv preprint ar Xiv:2112.01506, 2021. Kober, J., Bagnell, J. A., and Peters, J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238 1274, 2013. Lam, H. Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Operations Research, 67(4):1090 1105, 2019. Lam, H. and Zhou, E. The empirical likelihood approach to quantifying uncertainty in sample average approximation. Operations Research Letters, 45(4):301 307, 2017. Lee, J. and Raginsky, M. Minimax statistical learning with wasserstein distances. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, pp. 2692 2701, USA, 2018. Curran Associates Inc. Li, Y., Szepesvari, C., and Schuurmans, D. Learning exercise policies for american options. In Artificial Intelligence and Statistics, pp. 352 359, 2009. Maitin-Shepard, J., Cusumano-Towner, M., Lei, J., and Abbeel, P. Cloth grasp point detection based on multipleview geometric cues with application to robotic towel folding. In 2010 IEEE International Conference on Robotics and Automation, pp. 2308 2315, 2010. Mannor, S., Mebel, O., and Xu, H. Robust mdps with k-rectangular uncertainty. Mathematics of Operations Research, 41(4):1484 1509, 2016. Melo, F. S. Convergence of q-learning: A simple proof. Institute Of Systems and Robotics, Tech. Rep, pp. 1 4, 2001. Mohajerin Esfahani, P. and Kuhn, D. Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming, 171(1):115 166, Sep 2018. ISSN 1436-4646. doi: 10.1007/s10107-017-1172-1. Morimoto, J. and Doya, K. Robust reinforcement learning. Neural computation, 17(2):335 359, 2005. Namkoong, H. and Duchi, J. C. Stochastic gradient methods for distributionally robust optimization with fdivergences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2216 2224. Red Hook: Curran Associates Inc., 2016. Nguyen, V. A., Kuhn, D., and Esfahani, P. M. Distributionally robust inverse covariance estimation: The Wasserstein shrinkage estimator. ar Xiv preprint ar Xiv:1805.07194, 2018. Ren, Z. and Zhou, Z. Dynamic batch learning in highdimensional sparse linear contextual bandits. ar Xiv preprint ar Xiv:2008.11918, 2020. Sadeghi, F. and Levine, S. Cad2rl: Real single-image flight without a single real image. ar Xiv preprint ar Xiv:1611.04201, 2016. Saerens, M., Latinne, P., and Decaestecker, C. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21 41, 2002. Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019. Schulman, J., Ho, J., Lee, A. X., Awwal, I., Bradlow, H., and Abbeel, P. Finding locally optimal, collision-free trajectories with sequential convex optimization. In Robotics: science and systems, volume 9, pp. 1 10. Citeseer, 2013. Distributionally Robust Q-Learning Shafieezadeh-Abadeh, S., Esfahani, P., and Kuhn, D. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems 28, pp. 1576 1584. 2015. Shafieezadeh-Abadeh, S., Kuhn, D., and Esfahani, P. M. Regularization via mass transportation. Journal of Machine Learning Research, 20(103):1 68, 2019. Shapiro, A. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258 2275, 2017. Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000. Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributional robust batch contextual bandits. ar Xiv preprint ar Xiv:2006.05630, 2020a. Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributionally robust policy evaluation and learning in offline contextual bandits. In International Conference on Machine Learning (ICML), 2020b. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140 1144, 2018. Sinha, A., Namkoong, H., and Duchi, J. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. Staib, M. and Jegelka, S. Distributionally robust deep learning as a generalization of adversarial training. In NIPS workshop on Machine Learning and Computer Security, 2017. Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., and Hardt, M. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning, pp. 9229 9248. PMLR, 2020. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Tram er, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., and Mc Daniel, P. Ensemble adversarial training: Attacks and defenses. ar Xiv preprint ar Xiv:1705.07204, 2017. Volpi, R., Namkoong, H., Sener, O., Duchi, J., Murino, V., and Savarese, S. Generalizing to unseen domains via adversarial data augmentation. ar Xiv preprint ar Xiv:1805.12018, 2018. Wang, M. and Deng, W. Deep visual domain adaptation: A survey. Neurocomputing, 312:135 153, 2018. Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279 292, 1992. Wiesemann, W., Kuhn, D., and Rustem, B. Robust markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013. Wolff, E. M., Topcu, U., and Murray, R. M. Robust control of uncertain markov decision processes with temporal logic specifications. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 3372 3379. IEEE, 2012. Xu, H. and Mannor, S. Distributionally robust markov decision processes. In Advances in Neural Information Processing Systems, pp. 2505 2513, 2010. Yang, I. Wasserstein distributionally robust stochastic control: A data-driven approach. IEEE Transactions on Automatic Control, 2020. Yang, W., Zhang, L., and Zhang, Z. Towards theoretical understandings of robust markov decision processes: Sample complexity and asymptotics. ar Xiv preprint ar Xiv:2105.03863, 2021. Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., and Jordan, M. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pp. 7472 7482. PMLR, 2019. Zhao, C. and Guan, Y. Data-driven risk-averse stochastic optimization with Wasserstein metric. Operations Research Letters, 46(2):262 267, 2018. ISSN 0167-6377. doi: https://doi.org/10.1016/j.orl.2018.01.011. Zhao, C. and Jiang, R. Distributionally robust contingencyconstrained unit commitment. IEEE Transactions on Power Systems, 33(1):94 102, 2017. Zhou, Z., Zhou, Z., Bai, Q., Qiu, L., Blanchet, J., and Glynn, P. Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In Banerjee, A. and Fukumizu, K. (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 3331 3339. PMLR, 13 15 Apr 2021. URL https://proceedings.mlr.press/ v130/zhou21d.html. Distributionally Robust Q-Learning A. Technical Lemmas First of all, we introduce two ancillary concentration inequalities. Lemma A.1 ((Fournier & Guillin, 2015), Concentration inequality for Wasserstein distance ). For µ P(R), we consider an i.i.d. sequence (Xk)k 1 of µ-distributed random variables and, for all n 1, the empirical measure Assume that there exists γ > 0 such that E2,γ(µ) := R R exp(γ|x|2)µ(dx) < . Then for all n 1, all x > 0, P (W(µn, µ) x) Ce cnx2, where the Wasserstein distance W(µn, µ) is defined by W(µn, µ) := inf π Π(µn,µ) Z |x y|π(dx, dy) , and the positive constant C and c depends only on γ and E2,γ(µ). Lemma A.2 (Hoeffding s inequality). Let X1, , Xn be independent random variables such that Xi [ai, bi] almost surely for all i = 1, 2, , n. Then for every t > 0, i=1 (Xi EXi) t 2 exp 2n2t2 Pn i=1(bi ai)2 B. Missing Proof of Theorem 3.7 Proof. Fix a pair (s, a) S A, we first prove b Rrob δ (s, a) defined in Equation (9) is an unbiased estimator of supα 0 n α log Eν0s,a e r(s,a)/α αδ o which forms the first part of our δ-distributionally robust Bellman Operator (see Equation (6)). Let g(α) = α log Eν0s,a h e r(s,a)/αi αδ. Denote α = argmaxα 0 g(α). Given 2N samples {ri ν0 s,a}, let bν2N = 1 2N P2N i=1 δri. Similarly, bν2N O (resp. bν2N E ) is defined by using the subset of 2N samples in which the index of every element is odd (resp. even). We use g2N to represent the function defined by replacing ν0 s,a by bν2N in g and α 2N = argmaxα 0 g2N (α). Similarly, we also have g2N O , g2N E , α 2N O and α 2N E . Now we have E h b Rrob δ (s, a) i = E r1 + r N,δ p N = E [r1] + E r N,δ p N = E [g20(α 20)] + n=0 E r N,δ p N = E [g20(α 20)] + n=0 E r n,δ = E [g20(α 20)] + g2n+1(α 2n+1) g2n+1 O (α 2n+1 O ) + g2n+1 E (α 2n+1 E ) = E [g20(α 20)] + n=0 E [g2n+1(α 2n+1) g2n(α 2n)] = lim n E [g2n(α 2n)] . Distributionally Robust Q-Learning We only need to show lim n E [g2n(α 2n)] = g(α ). (13) Based on our Assumption 2.1, we can give an upper bound of α by observing 0 ess inf r(s, a) = α log Eν0s,a h e r(s,a)/β i α δ By a similar argument, α 2n Rmax/δ also holds. Besides, from above derivation, we can find 0 ess inf r(s, a) g(α ) Rmax. (14) Thus 0 g2n(α 2n) Rmax (15) holds by a similar argument. Starting from here, we split the proof into two cases: α = 0 and α > 0. α = 0. By proposition 2 in (Hu & Hong, 2013), α = 0 implies λ := ν0 s,a(r(s, a) = ess inf r(s, a)) > 0. We first introduce the following two events: G2n := { r i = ess inf r(s, a)} { r i ess inf r(s, a)}, Z2n := {α 2n = 0}. Note that we have the following result for G2n: P[Gc 2n] = 1 P[G2n] = 1 P[{ r i = ess inf r(s, a)}] = P[{ r i = ess inf r(s, a)}] By Lemma 4 in (Zhou et al., 2021), ϵ > 0, there exists a constant N(ϵ, δ), such that n N(ϵ, δ), P[Zc 2n] ϵ. Now we choose ϵ > 0 arbitrarily, when n N(ϵ), E [|g2n(α 2n) g(α )|] = E [|g2n(α 2n) g(α )|1G2n Z2n] + E |g2n(α 2n) g(α )|1Gc 2n Zc 2n = E |g2n(α 2n) g(α )|1Gc 2n Zc 2n Rmax(P[Gc 2n] + P[Zc 2n]) Rmax((1 λ)2n + ϵ), where the first inequality is by Equation (14) and Equation (15). Hence we know lim n E [|g2n(α 2n) g(α )|] ϵRmax, which implies lim n E [|g2n(α 2n) g(α )|] = 0. Using above result, Equation (13) holds immediately in this case. Distributionally Robust Q-Learning α > 0. We define the following event in this case, By Lemma 4 in (Zhou et al., 2021), ϵ > 0, there exists a constant N (ϵ), such that once n N (ϵ), we have P[NZc 2n] ϵ. Now we choose ϵ > 0 arbitrarily, when n N (ϵ) E [|g2n(α 2n) g(α )|] E [|g2n(α 2n) g(α )|1NZ2n] + E |g2n(α 2n) g(α )|1NZc 2n δ ] |g2n(α) g(α)|1NZ2n Observe that Eν0s,a e r(s,a)/α e Rmax/α and Ebν2n e r(s,a)/α e Rmax/α hold, combining the Lipschitz property of log x when x is bounded below, we know |g2n(α) g(α)| α = log Eν0s,a h e r(s,a)/αi log Ebν2n h e r(s,a)/αi Eν0s,a e r(s,a)/α Ebν2n e r(s,a)/α Then we have δ ] |g2n(α) g(α)|1NZ2n sup α [ α δ ] |g2n(α) g(α)| = sup α [ α |g2n(α) g(α)| Eν0s,a h e r(s,a)/αi Ebν2n h e r(s,a)/αi Rmaxe2Rmax/α For any α h α δ i , the function e x/α is a Lipschitz function on [0, ), and the Lipschitz constant is bounded by 2/α . Hence, by the dual representation of Wasserstein distance, we have δ ] |g2n(α) g(α)|1NZ2n 2Rmaxe2Rmax/α δα W(ν0 s,a, bν2n), (16) where the Wasserstein distance W(ν0 s,a, bν2n) is defined by W(ν0 s,a, bν2n) := inf π Π(ν0s,a,bν2n) Z |x y|dπ(x, y) . Now we know E [|g2n(α 2n) g(α )|] 2Rmaxe2Rmax/α δα E W(ν0 s,a, bν2n) + Rmaxϵ. By Lemma A.1 (note that we assume r is bounded, so the condition for Lemma A.1 is satisfied automatically), there exists c, C > 0 such that P[W(ν0 s,a, bν2n) t] Ce 2nct2. We have E W(ν0 s,a, bν2n) = Z 0 P[W(ν0 s,a, bν2n) t]dt 0 P[W(ν0 s,a, bν2n) t]dt 0 Ce 2nct2dt Distributionally Robust Q-Learning This implies lim n E [|g2n(α 2n) g(α )|] Rmaxϵ. Note that we choose ϵ > 0 arbitrarily, so there is lim n E [|g2n(α 2n) g(α )|] = 0 Hence we know Equation (13) holds in this case. Now we can conclude that b Rrob δ (s, a) is an unbiased estimator of supα 0 n α log Eν0s,a e r(s,a)/α αδ o . From here we give the proof that b T rob δ (Q)(s, a) defined in Equation (10) is an unbiased estimator of supβ 0 n β log Ep0s,a h e maxb A Q(s ,b)/βi βδ o which constructs the remaining part of our δ-distributionally robust Bellman Operator. Given Q RS A and (s, a) S A. We define V (s ) = maxb A Q(s , b) for any s S and S = argmin s S,p0s,a(s )>0 V (s ). Let f(β) = β log Ep0s,a h e V (s )/βi βδ. Denote β = argmaxβ 0 f(β). Given 2N samples {s i p0 s,a}, let bp2N (s) = P2N i=1 1[s i=s] 2N . Similarly, bp2N O (resp. bp2N E ) is defined by using the subset of 2N samples in which the index of every element is odd (resp. even). We use f2N to represent the function defined by replacing p0 s,a by bp2N in f and β 2N = argmaxβ 0 f2N (β). Similarly, we also have f2N O , f2N E , β 2N O and β 2N E . Following the similar proof for g used in the previous part, we will have E h b T rob δ (Q)(s, a) i = lim n E [f2n(β 2n)] . Hence we only need to show lim n E [f2n(β 2n)] = f(β ). (17) We can observe that both β and β 2n are no bigger than 2 Q /δ. For β , this is because min s S p0 s,a(s )>0 V (s ) =f(0) = β log Ep0s,a h e V (s )/β i β δ max s S p0 s,a(s )>0 max s S p0 s,a(s )>0 V (s ) min s S p0 s,a(s )>0 If we apply the similar argument to β 2n, we can get the same result. Besides, from the above derivation, we can see Q min s S,p0s,a(s )>0 V (s ) f(β ) max s S,p0s,a(s )>0 V (s ) Q . (18) Besides, Q f2n(β 2n) Q (19) Distributionally Robust Q-Learning also holds with a similar reason. Now we define the following two key events, E2n := p0 s,a(s i) > 0, 1 i 2n , s S bp2n(s ) 1 s S p0 s,a(s ) Note the facts P [En 2 ] = 1 and P [T c 2n] = P s S bp2n(s ) < 1 s S p0 s,a(s ) s S p0 s,a(s ) X s S bp2n(s ) > 1 s S p0 s,a(s ) s S p0 s,a(s ) X 1[s i = s ] s S p0 s,a(s ) s S p0 s,a(s ) 1 s S 1[s i = s ] > 1 s S p0 s,a(s ) s S p0 s,a(s )) 2 , where the last inequality is right because of Lemma A.2. With the above proposition, we can start to show that Equation (17) is true, we consider the following decomposition E [|f2n(β 2n) f(β )|] = E [|f2n(β 2n) f(β )|1E2n] = E [|f2n(β 2n) f(β )|1E2n T2n] + E |f2n(β 2n) f(β )|1E2n T c 2n E [|f2n(β 2n) f(β )|1T2n E2n] + 2 Q P [T c 2n E2n] , where the last inequality holds due to Equation (18) and Equation (19). Now denote mins S,p0s,a(s )>0 V (s ) as v. Note that we have shown 0 β , β 2n 2 Q /δ. Observe that under T2n and E2n, β [0, 2 Q /δ], there are Ep0s,a h e( V (s )+v)/βi = X s S p0 s,a(s )e( V (s )+v)/β X s S p0 s,a(s )e( V (s )+v)/β s S p0 s,a(s ) 1 s S p0 s,a(s ) Ebp2n h e( V (s )+v)/βi = X s S bp2n(s )e( V (s )+v)/β X s S bp2n(s )e( V (s )+v)/β s S bp2n(s ) 1 s; S p0 s,a(s ). Then under T2n and E2n, we know |f2n(β 2n) f(β )| sup δ ] β log Ep0 s,a h e V (s )/βi log Ebp2n h e V (s )/βi log Ep0 s,a h e( V (s )+v)/βi log Ebp2n h e( V (s )+v)/βi Ep0 s,a h e( V (s )+v)/βi Ebp2n h e( V (s )+v)/βi 1 2 P s S p0s,a(s ) s S p0s,a(s ) p0 s,a bp2n 1. Distributionally Robust Q-Learning The third inequality is right is by the Lipschitz property for log x when x is bounded from below. Finally we have E [|f2n(β 2n) f(β )|] E [|f2n(β 2n) f(β )|1T2n E2n] + 2 Q P [T c 2n E2n] s S p0s,a(s )E p0 s,a bp2n 1 + Q e 2n 1( P s S p0 s,a(s )) 2 . E p0 s,a bp2n 1 = Z 0 P p0 s,a bp2n 1 t dt 0 P p0 s,a bp2n 1 t dt s S P |p0 s,a(s ) bp2n(s )| t |S| 0 2e 2n+1t2/|S|2dt where the second inequality is by Lemma A.2. Hence we have E [|f2n(β 2n) f(β )|] 4 Q |S|2 π P s S p0s,a(s ) 2 n+3 2 + 2 Q e 2n 1( P s S p0 s,a(s )) 2 . This is enough to conclude lim n E [|f n 2 (β 2n) f(β )|] = 0. (21) Note that Equation (21) implies that Equation (17) is true. Now we complete our partial proof, i.e., b T rob δ (Q)(s, a) is an unbiased estimator of supβ 0 n β log Ep0s,a h e maxb A Q(s ,b)/βi βδ o . Finally we complete the proof that b T rob δ,ε (Q)(s, a) := b Rrob δ (s, a)+γ b T rob δ (Q)(s, a) is an unbiased estimator of T rob δ (Q)(s, a) for any Q RS A and (s, a) S A. C. Missing Proof of Theorem 3.8 Proof. Following the same notations defined in the proof of Theorem 3.7 (cf. Appendix B), we have already known for any ε (0, 0.5), Q RS A and (s, a) S A, E[b T rob δ,ε (Q)(s, a)] = E[ b Rrob δ (s, a) + γ b T rob δ (Q)(s, a)] n α log Eν0s,a h e r(s,a)/αi αδ o + γ sup β 0 n β log Ep0s,a h e maxb A Q(s ,b)/βi βδ o = g(α ) + γf(β ). By the definition of variance we have Var[b T rob δ,ε (Q)(s, a)] 2Var[ b Rrob δ (s, a)] + 2γ2Var[ b T rob δ (Q)(s, a)]. Distributionally Robust Q-Learning We first analysis the term Var[ b Rrob δ (s, a)] by noticing Var[ b Rrob δ (s, a)] = E[( b Rrob δ (s, a) g(α ))2] " r1 + r N,δ p N 2E[(r1)2] + 2E " r N,δ p N 2R2 max + 2 E[( r n,δ)2] pn . Now let us bound the term E[( r n,δ)2]. Like the proof of Theorem 3.7, we also consider following two cases α = 0. By Proposition 2 in (Hu & Hong, 2013), α = 0 if and only if λ := ν0 s,a(r(s, a) = ess inf r(s, a)) > 0 and λ e δ. Since δ is chosen by us, we can ignore the edge case λ = e δ by introducing randomness on δ. Now we define the following three events: C2n O := {bν2n O(ess inf r(s, a)) > e δ}, C2n E := {bν2n E(ess inf r(s, a)) > e δ}, C2n := {bν2n(ess inf r(s, a)) > e δ}. When the event C2n O G2n O (recall G2n O := { r i = ess inf r(s, a), i {1, 3, , 2n 1}} { r i ess inf r(s, a), i {1, 3, , 2n 1}}) holds, again, Proposition 2 in (Hu & Hong, 2013) implies that we have α 2n O = 0 and g2n O(α 2n O) = ess inf r(s, a). The same argument can be applied to the subscript 2n E and 2n. Besides, note that C2n O C2n E C2n and G2n O G2n E G2n, which implies C2n O C2n E G2n O G2n E C2n G2n. Then we have the following result E[( r n,δ)2] = E g2n+1(α 2n+1) g2n+1 O (α 2n+1 O ) + g2n+1 E (α 2n+1 E ) g2n+1(α 2n+1) g2n+1 O (α 2n+1 O ) + g2n+1 E (α 2n+1 E ) 1C2n+1 O C2n+1 E G2n+1 O G2n+1 E g2n+1(α 2n+1) g2n+1 O (α 2n+1 O ) + g2n+1 E (α 2n+1 E ) 1(C2n+1 O C2n+1 E G2n+1 O G2n+1 E )c g2n+1(α 2n+1) g2n+1 O (α 2n+1 O ) + g2n+1 E (α 2n+1 E ) 1(C2n+1 O C2n+1 E G2n+1 O G2n+1 E )c 2R2 max(P[Cc 2n] + P[Gc 2n]). Note the following two bounds P[Cc 2n] e 2n+1(λ e δ)2, P[Gc 2n] (1 λ)2n. The first bound is due to Lemma A.2 again. Finally we have E[( r n,δ)2] 2R2 max(e 2n+1(λ e δ)2 + (1 λ)2n). Distributionally Robust Q-Learning Then if we choose pn = ε(1 ε)n, we know Var[ b Rrob δ (s, a)] 2R2 max + 4R2 max e 2n+1(λ e δ)2 + (1 λ)2n pn = K 1(λ, δ, ε, ν0) < . Note that λ depends on (s, a). However, the number of pair (s, a) is finite, hence we can find a uniform constant bound K1(δ, ε, ν0) K 1(λ, δ, ε, ν0) for any (s, a) S A which makes α = 0. α > 0. In this case, we follow the way proposed by (Zhou et al., 2021). Define τ := min n α log Eν0s,a h e X/αi + αδ, α log Eν0s,a h e X/ αi + αδ o α log Eν0s,a h e X/α i + α δ , where α = α /2 and α = Rmax/δ. Besides, we introduce the following event F2n(α) := Ebν2n h e r(s,a)/αi Eν0s,a h e r(s,a)/αi < τ 2αe Rmax/α 1 . Note that by Lemma 4 in (Zhou et al., 2021), under F2n(α) F2n(α) F2n(α ), we have α 2n [α, α]. Let F2n = F2n(α) F2n(α) F2n(α ). Note that F2n O F2n E F2n, thus we have E[( r n,δ)2] = E g2n+1(α 2n+1) g2n+1 O (α 2n+1 O ) + g2n+1 E (α 2n+1 E ) g2n+1(α 2n+1) g2n+1 O (α 2n+1 O ) + g2n+1 E (α 2n+1 E ) 1F2n+1 O F2n+1 E g2n+1(α 2n+1) g2n+1 O (α 2n+1 O ) + g2n+1 E (α 2n+1 E ) 1(F2n+1 O F2n+1 E )c g2n+1(α 2n+1) g2n+1 O (α 2n+1 O ) + g2n+1 E (α 2n+1 E ) 1F2n+1 O F2n+1 E + 2R2 max P[F c 2n] 2E h (g2n+1(α 2n+1) g(α ))2 1F2n+1 i + 2E h (g2n(α 2n) g(α ))2 1F2n i + 2R2 max P[F c 2n] 16R2 maxe4Rmax/α (δα )2 E[W2(ν0 s,a, bν2n+1)] + 16R2 maxe4Rmax/α (δα )2 E[W2(ν0 s,a, bν2n)] + 2R2 max P[F c 2n], where the last inequality holds by a similar argument for Equation (16). Using a similar calculation used in the proof of Theorem 3.7, we can find E[W2(ν0 s,a, bν2n+1)] = O(2 n), E[W2(ν0 s,a, bν2n)] = O(2 n). Besides, by Lemma 4 in (Zhou et al., 2021), we know P[F c 2n] = O(e 2n), Besides, note pn = ε(1 ε)n for ε (0, 0.5). Then we know E[( b Rrob δ (s, a) g(α ))2] 16R2 maxe4Rmax/α (δα )2 E[W2(ν0 s,a, bν2n+1)] pn + 16R2 maxe4Rmax/α (δα )2 E[W2(ν0 s,a, bν2n)] pn + 2R2 max P[F c 2n] pn =K 2(α , δ, ε, ν0) Since we are in the tabular setting ,we can find a constant K2(δ, ε, ν0) > K 2(α , δ, ε, ν0) for any (s, a) which makes α > 0. Distributionally Robust Q-Learning By the previous two cases, we know E[( b Rrob δ (s, a) g(α ))2] max(K1(δ, ε, ν0), K2(δ, ε, ν0)). Now we start to analysis Var[ b T rob δ (Q)(s, a)] = E[( b T rob δ (Q)(s, a) f(β ))2]. By the similar approach, we can find E[( b T rob δ (Q)(s, a) f(β ))2] = E[(max b A Q(s 1, b) + q N,δ p N f(β ))2] E[(max b A Q(s 1, b) + q N,δ p N )2] 2E[(max b A Q(s 1, b))2] + 2E E[( q n,δ)2] Use the same notations we defined in the proof of Theorem 3.7, we can find E[( q n,δ)2] f2n+1(β 2n+1) f2n+1 O (β 2n+1 O ) + f2n+1 E (β 2n+1 E ) f2n+1(β 2n+1) f2n+1 O (β 2n+1 O ) + f2n+1 E (β 2n+1 E ) 1E2n+1 O E2n+1 E T2n+1 O T2n+1 E f2n+1(β 2n+1) f2n+1 O (β 2n+1 O ) + f2n+1 E (β 2n+1 E ) 1(E2n+1 O E2n+1 E T2n+1 O T2n+1 E )c f2n+1(β 2n+1) f2n+1 O (β 2n+1 O ) + f2n+1 E (β 2n+1 E ) 1E2n+1 O E2n+1 E T2n+1 O T2n+1 E + 8 Q 2 (P[Ec 2n] + P[T c 2n]) 2E h (f2n+1(β 2n+1) f(β ))2 1E2n+1 T2n+1 i + 2E h (f2n(β 2n) f(β ))2 1E2n T2n i + 8 Q 2 (P[Ec 2n] + P[T c 2n]) s S p0s,a(s ))2 E[ p0 s,a bp2n+1 2 1] + 32 Q 2 (P s S p0s,a(s ))2 E[ p0 s,a bp2n 2 1] + 8 Q 2 (P[Ec 2n] + P[T c 2n]), where the last inequality follows a similar argument like Equation (20). By a similar calculation in the proof of Theorem 3.7, we will have E[ p0 s,a bp2n+1 2 1] = O(2 n) and E[ p0 s,a bp2n 2 1] = O(2 n). We have already known P[T c 2n] = O(e 2n) and P[Ec 2n] = 0 from the proof of Theorem 3.7. Note that pn = ε(1 ε)n for ε (0, 0.5), we have E[( q n,δ)2] pn = Q 2 K 3(s, a, ε, δ, P0) < . Now we can find a uniform bound K3(δ, ε, P0) > 0 such that E[( b T rob δ (Q)(s, a) f(β ))2] 2 Q 2 + 2 E[( q n,δ)2] (2 + 2K 3(s, a, ε, δ, P0)) Q 2 K3(δ, ε, P0)(1 + Q 2 ), Thues we can see Var[b T rob δ,ε (Q)(s, a)] 2 max(K1(δ, ε, ν0), K2(δ, ε, ν0)) + 2γ2K3(δ, ε, P0)(1 + Q 2 ) C(δ, ε, ν0, P0)(1 + Q 2 ), where C(δ, ε, ν0, P0) > 0 is some uniform constant. Hence we complete the proof.