# bayesian_riskaverse_qlearning_with_streaming_observations__71fd8ee6.pdf Bayesian Risk-Averse Q-Learning with Streaming Observations Yuhao Wang School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 30332 yuhaowang@gatech.edu Enlu Zhou School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 30332 enlu.zhou@isye.gatech.edu We consider a robust reinforcement learning problem, where a learning agent learns from a simulated training environment. To account for the model misspecification between this training environment and the real environment due to lack of data, we adopt a formulation of Bayesian risk MDP (BRMDP) with infinite horizon, which uses Bayesian posterior to estimate the transition model and impose a risk measure to account for the model uncertainty. Observations from the real environment that are out of the agent s control arrive periodically and are utilized by the agent to update the Bayesian posterior to reduce model uncertainty. We theoretically demonstrate that BRMDP balances the trade-off between robustness and conservativeness, and we further develop a multi-stage Bayesian risk-averse Q-learning algorithm to solve BRMDP with streaming observations from real environment. The proposed algorithm learns a risk-averse yet optimal policy that depends on the availability of real-world observations. We provide a theoretical guarantee of strong convergence for the proposed algorithm. 1 Introduction Markov Decision Process (MDP) is widely applied in the Reinforcement Learning (RL) community to model sequential decision-making, where the underlying transition process is a Markov Process for a given policy. The Q-learning algorithm, which was first proposed in [28], learns the optimal Q-function of the infinite-horizon MDP. The Q-function is a function in terms of the state and action pair (s, a), which denotes the expected reward when action a is taken at the initial state s and optimal action is taken at subsequent stages. In a Q-learning algorithm, the decision maker updates the value of the Q-function by constantly interacting with the training environment and observing the reward and transition each time. The training environment is often the estimate of the real environment (where the policy is actually implemented) through past observed data. Implementing the policy learned from the training environment in the real environment directly can be risky, because of model misspecification, which is caused by the lack of historical data and possible environment shift. For example, in an inventory management problem where the demand follows some unknown distribution, we have limited observations of past demand realizations or only partially observed data if unfulfilled lost orders are not observed. One common approach is to consider a distributionally robust formulation, where one assumes the embedding unknown distribution belongs to some ambiguity set. This ambiguity set is usually chosen as a neighborhood of the reference distribution (estimated from data). For example, [23] considered the distributionally robust policy learning for the contextual bandit problems, and [32, 17, 30] studied the distributionally robust policy learning for the discounted RL setting. In particular, [14] and [15] considered Q-learning methods for distributionally robust RL with a discounted reward, where [14] constructed the ambiguity set using the Kullback-Leibler (KL) divergence whereas [15] used the 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Wasserstein distance; Both works designed the Q-learning algorithm by reformulating the robust Bellman equation in its dual form, and thus, transforming the optimization over the distributions to the optimization over a scalar. While distributionally robust formulation accounts for model misspecification and provides a robust policy that has relatively good performance over all possible distributions in the ambiguity set, it is also argued to be conservative as a price for robustness, especially when the worst-case scenario is unlikely to happen in reality. This motivates us to take a more flexible risk quantification criterion instead of only considering the worst-case performance. [31, 29] proposed a Bayesian risk optimization (BRO) framework, which adopts the Bayesian posterior distribution, as opposed to an ambiguity set, to quantify distributional parameter uncertainty. With this quantification, a risk measure, which represents the user s attitude towards risk, is imposed on the objective with respect to the posterior distribution. In this paper, we take this Bayesian risk perspective and formulate the infinite-horizon Bayesian risk MDP (BRMDP) with discounted reward. BRMDP was first proposed in [13], where they model the unknown parameter of the MDP via a Bayesian posterior distribution. The posterior distribution is absorbed as an augmented state, together with the original physical state, and is updated at each stage through the realization of the reward and transition at that stage. A risk measure is imposed on the future reward at each stage taken with respect to the posterior distribution at that stage. Before [13], using the Bayesian technique in MDP was considered in [8] to deal with parameter uncertainty, where they simply took expectation on the total reward with respect to the posterior distribution and referred to this MDP model as Bayes-adaptive MDP (BAMDP). Based on BAMDP, there is a stream of work on model-based Bayesian RL (e.g., [25, 7, 27, 18, 3, 20, 4, 16]). Other model-free Bayesian RL methods include Gaussian process temporal difference learning [9], Bayesian policy gradient methods [11], and Bayesian actor-critic methods [26]. We referred to [12] for a comprehensive overview of Bayesian RL. More recently, [22] and [19] considered risk-sensitive BAMDP by imposing the risk measure on the total reward. In addition, [5] considered a percentile criterion and formulated a second-order cone programming by assuming a Gaussian random reward. All three works mentioned above took a non-nested formulation i.e., only one risk measure is imposed on the total reward. As pointed out in [13], one benefit of the nested formulation (i.e., a risk measure is imposed at each stage) is that the resulting optimal policy is time consistent, meaning the policy solved at the initial stage remains optimal at any later stage. In contrast, the non-nested formulation does not have this property (see [21]). In this paper, we considered the nested formulation as in [13]. Notably, [13] as well as most works of BAMDP considered the planning problem over a finite horizon. The problem can be solved efficiently only when the number of horizons is small since the number of augmented states, i.e., possible posterior distributions, increases exponentially as the time horizon increases. In this paper, we formulate the infinite-horizon BRMDP whose state space only contains the physical states. As a result, this enables us to find the stationary (time-invariant) optimal policy. We develop a Q-learning algorithm to learn this stationary optimal policy, which is quite different from the dynamic programming approach taken by [13]. On a related note, risk measures are also widely applied in the literature on safe RL (see [10] for a recent survey). In safe RL, risk measures are used to ensure that the learned policy not only performs well but also avoids dangerous states or actions that can cause large one-stage costs. risk measures are typically applied to deal with the intrinsic uncertainty of the MDP, taking into account the known transition probability or reward distribution. In contrast, our work uses risk measures to account for parametric uncertainty in the transition probability, which can prevent finding the optimal policy. Moreover, in safe RL, the risk measure is usually imposed on some cost function to satisfy some safety constraint, while we impose it on the reward objective. Although both areas use risk measures, they have different goals and frameworks. To summarize, all the pre-mentioned work on robust RL has focused on offline learning where the agent does not interact with the real environment, but only has access to data from the training environment. In contrast, our work utilizes real-world data to update the Bayesian posterior in the offline learning process to reduce the parametric uncertainty. We also differ from Bayesian online learning, where the Bayesian posterior is updated periodically, and another optimal policy is then re-solved (regardless of computational cost) and deployed. Our approach is off-policy learning, where the learning agent receives streaming real-world observations from some behavior policy, and we focus on designing a Q-learning algorithm for the varying BRMDP that possesses statistical validity rather than simply assuming optimality is obtained after each posterior update. 2 Preliminary and Problem Statement 2.1 Standard Reinforcement Learning Consider an RL environment Mc = (S, A, Pc, r, γ), where S is a finite state space, A is a finite action space, r : S A S R is the reward function bounded by R = maxs,a,s |r(s, a, s )|, Pc = {pc s,a( )}(s,a) S A is the transition probabilities, and γ is the discount factor. The standard reinforcement learning aims to find a (deterministic) optimal policy π : S A satisfying V π(s) = Es pc s,π(s)[r(s, π(s), s ) + γV π(s )] = sup a A n Es pcs,a[r(s, a, s ) + γV π(s )] o . 2.2 Bayesian Risk Markov Decision Process (BRMDP) The true transition probabilities, Pc, are usually unknown in real problems, and we need to estimate them using observations from the real world. However, as discussed before, model misspecification due to lack of data can impair the performance of learned policy when deployed in the real environment. Hence, we take a Bayesian approach to estimate the environment and impose a risk measure on the objective with respect to the Bayesian posterior to account for this model (parametric) uncertainty. With finite state and action spaces, we can impose a Dirichlet conjugate prior ϕs,a on the transition model of each state-action (s, a) pair and update the Bayesian posterior once we observe a transition from state s to s by taking action a. We defer the details of updating the Dirichlet posterior on (s, a) to the supplementary material. The Bayesian posterior provides a quantification of uncertainty about the transition model, which a risk-sensitive decision maker seeks to address by making robust decisions using the current model estimate. This can be done by imposing a risk measure on the future reward at each stage, which maps a random variable to a real value and reflects different attitudes toward risk. Given a Dirichlet posterior ϕ = {ϕs,a}s,a and a risk measure ρξ(f(ξ)) that maps a function f of the random vector ξ to a real value, the value function of BRMDP under a policy π is defined as V ϕ,π(s0) =Ed0 π(s0){ρp1 ϕs0,d0 (Es1 p1[r(s0, d0, s1)+ γEd1 π(s1){ρp2 ϕs1,d1 (Es2 p2[r(s1, d1, s2)+ γEd2 π(s2){ρp3 ϕs2,d2 (Es3 p3[r(s2, d2, s3) + We assume the risk measure ρ satisfies the following assumption. Assumption 2.3. Let ξ P(Rn) denote a random vector taking values in Rn, and let fi : Rn R, i = 1, 2 denote two measurable functions. The risk measure satisfies the following conditions: 1. ρξ(γf1(ξ)) = γρ(f1(ξ)) for all γ 0. 2. ρξ(f1(ξ)) ρξ(f2(ξ)) if f1(ξ) f2(ξ) almost surely. 3. ρξ(f1(ξ) + C) = ρξ(f1(ξ)) + C for all constant C. risk measures satisfying Assumption 2.3 are similar to the coherent risk measures (see [1]), except that the sub-additivity is not required. We relax the assumption of coherent risk measure to include the risk measure Va R, which does not admit the sub-additivity. Notice (1) takes a nested formulation, where the risk measure is imposed on the future reward at each stage as opposed to the non-nested formulation where the risk measure is only imposed once on the total reward, i.e., the value function of the classic MDP. One advantage of the nested formulation is that (1) can be expressed in a recursive form: V ϕ,π(s) = Ea π(s){ρp ϕs,a Es p[r(s, a, s ) + γV ϕ,π(s )] }. This enables us to define the Bellman operator to find the time-invariant optimal policy. Let π denote the optimal policy such that V ϕ, := V ϕ,π satisfying V ϕ, (s0) = sup π Π Ed0 π(s0){ρp1 ϕs0,d0 (Es1 p1[r(s0, d0, s1)+ γEd1 π(s1){ρp2 ϕs1,d1 (Es2 p2[r(s1, d1, s2)+ γEd2 π(s2){ρp3 ϕs2,d2 (Es3 p3[r(s2, d2, s3) + where Π contains all randomized (and deterministic) policies. Let Lϕ be the Bellman operator such that LϕV (s) = max a A ρp ϕs,a(Es p[r(s, a, s ) + γV (s )]). The following theorem ensures that Lϕ is a contraction mapping and BRMDP admits an optimal value function which is the unique fixed point of Lϕ. Its proof is in the supplementary material. Theorem 2.4. BRMDPs possess the following properties: 1. Lϕ is a contraction mapping with ||LϕV LϕU|| γ||V U|| , where || || is the sup norm in R|S|. 2. There exists a unique V ϕ, such that LϕV ϕ, = V ϕ, . Moreover, V ϕ, (s0) = sup π Π {Ed0 π(s0)[ρp1 ϕs0,d0 (Es1 p1[r(s0, d0, s1)+ γEd1 π(s1)[ρp2 ϕs1,d1 (Es2 p2[r(s1, d1, s2)+ γEd2 π(s2)[ρp3 ϕs2,d2 (Es3 p3[r(s2, d2, s3) + . . . 2.5 BRMDP with Va R and CVa R Among choices of risk measures that satisfy Assumption 2.3, we adopt two of the most commonly used risk measures, Va R and CVa R. For a random variable X properly defined on some probability space, Va Rα is the α-quantitle of X, Va Rα(X) = inf{z|P(X z) α}, and CVa Rα normally is defined as the mean of α-tail distribution of X. However, since we are imposing the risk measure on the reward (regarded as the negative of loss) rather than the loss, we define CVa Rα(X) = 1 α R Va Rα(X) x P(dx), which is the conditional expectation for X Va Rα. Before discussing solving the BRMDP, a natural question is how this risk-averse formulation is related to the original risk-neutral objective. Let V c,π denote the value function of policy π under the true MDP Mc. the difference between V ϕ,π and V c,π is bounded by the following theorem. Theorem 2.6. Suppose the Bayesian prior ϕ0 s,a(s ) = 1, s, a, s and ρ is either Va Rα or CVa Rα. Let π be a deterministic policy. Let O = mins S Os,π(s) and Os,a be the number of observed transitions from s with action π(s) in the dataset. Assuming Os,a > 0 for all (s, a), then with probability at least V ϕ,π V c,π O 1 α 5|S| R (1 γ)2 , In particular, let O = mins S,a A Os,a, then with probability at least 1 O 1 V ϕ, V c, O 1 α 5|S| R (1 γ)2 where V c, is the optimal value function for Mc. Theorem 2.6 implies our reformulated BRMDP coincides with the risk-neutral MDP as more observations are available. In particular, the discrepancy is characterized in terms of the minimal number of observations for any state-action pair with the order at least O 1 3 . This indicates the BRMDP automatically balances the trade-off between robustness and conservativeness with varying numbers of observations. We defer the proof to the supplementary material due to the space limit. 2.7 Q-Learning for BRMDP Section 2.2 ensures BRMDP is well-formulated, that is, it admits an optimal value function which can be found as the fixed point of its Bellman operator Lϕ. This allows us to derive a Q-Learning algorithm to learn the optimal policy and value function of BRMDP. Recall that an optimal Q-function is defined as Qϕ, (s, a) = ρp ϕs,a(Es p[r(s, a, s ) + γV ϕ, (s )]) = ρp ϕs,a(Es p[r(s, a, s ) + γ max b A Qϕ, (s , b)]). Let T ϕ denote the Bellman operator for the Q-function. That is, T ϕQ(s, a) = ρp ϕs,a(Es p[r(s, a, s ) + γ max b A Q(s , b)]). The optimal Q-function Qϕ, (s, a) then satisfies Qϕ, (s, a) = (1 λ)Qϕ, (s, a) + λT ϕQϕ, (s, a), where λ (0, 1). In a Q-learning algorithm, given a sequence of learning rates {λt}t and an estimator of the Bellman operator b T ϕ, we have the Q-learning update rule Qt+1(s, a) = (1 λt)Qt(s, a) + λt b T ϕQt(s, a). 2.8 Bayesian Risk-Averse Q-Learning with Streaming Data Notice that in Section 2.2 and 2.7, the posterior distribution ϕ is fixed across stages. As in many previous works, the embedding model is estimated with a fixed set of past observations before solving the problem. However, this one-time estimation of the transition model does not take into consideration utilizing the new data that arrive later, which helps reduce the model uncertainty as well as the conservativeness caused by risk measure, as indicated by Theorem 2.6. This motivates us to consider a data-driven framework where we dynamically update the estimate of the model. For this purpose, we consider a multi-stage Bayesian Q-learning algorithm. Suppose at the beginning of stage t, a batch of observations in the form of three-tuple (si, ai, s i) with batch size n(t) is available. The observation can be regarded as generated by some policy that is actually deployed in the real world, which does not depend on the learning process. The decision maker incorporates these new data to update the posterior ϕt, with which we obtain a BRMDP model Mt. We then cary out m(t) steps of the Q-learning update, where the initial Q-function in stage t is inherited from the previous stage t 1. This framework is shown in Figure 1. Observed Data Update posterior Update Q-function for Observed Data Update posterior Update Q-function for Observed Data Update posterior Update Q-function for Figure 1: Multi-stage Bayesian risk-averse Q-learning 3 Estimator for Bellman Operator In Figure 1, a key step is to design a proper estimator b T ϕ for the Bellman operator to ensure the convergence of the Q-function. In most of the existing literature on Q-learning, convergence relies on an unbiased estimator of the Bellman operator. While this is usually easy to obtain with the expectation operator which is linear, in BRMDP unbiased estimator for the Bellman operator is difficult if not impossible to obtain, because of (i) the non-linearity of risk measure (Va R and CVa R) and (ii) varying posterior distributions. Unbiased estimators for nonlinear functionals have been studied in [2]; however, their method cannot be directly applied here since the variance of the estimator is uncontrollable in the existence of varying posteriors. Instead, we use the Monto Carlo simulation to obtain an estimator with almost surely diminishing bias. We show in Theorem 4.3 and 4.4 that the Monto Carlo estimator is sufficient to guarantee the convergence of the Q-function. 3.1 Monto Carlo Estimator for Va R and CVa R with Varying Sample Size f(p|s, a, Q) = Es p[r(s, a, s ) + γ max b A Q(s , b)] = X s S p(s )[r(s, a, s ) + γ max b A Q(s , b)]. (2) Given a posterior ϕ and a Q-function Q, we want to estimate T ϕQ(s, a) = ρp ϕs,a(f(p|s, a, Q)), where ρ is either Va Rα or CVa Rα and α (0, 1) denotes the risk level. A Monto Carlo estimator for Va Rα(f(p|s, a, Q)) and CVa Rα(f(p|s, a, Q)) with sample size N can be obtained by first drawing independent and identically distributed (i.i.d.) samples p1, p2, . . . , p N ϕs,a. Denote by Xi := f(pi|s, a, Q), i = 1, 2, . . . , N. Then b T ϕ N Q(s, a) = d Va R ϕs,a α,N(f(p|s, a, Q)) := X Nα :N if ρ is Va Rα, \ CVa R ϕs,a α,N(f(p|s, a, Q)) := 1 Nα k=1 Xk:N if ρ is CVa Rα, (3) where Xk:N is the kth order statistic out of N samples. Both estimators are biased with a constant sample size. Increasing the sample size reduces the bias but can be computationally inefficient. As the posterior distribution concentrates more on the true model, even estimators with small sample sizes can have reduced bias. We use a varying sample size that adapts to the updated posterior distribution. Initially, we set Ns,a as a fixed minimal sample size N, and decrease or increase Ns,a by 1 at the beginning of each stage based on whether a new observation improves or not the estimate for the transition model on (s, a). The multi-stage Bayesian Risk-averse Q-Learning algorithms with Va R (BRQL-Va R) and CVa R (BRQL-CVa R) are presented in Algorithm 1. 4 Convergence Analysis In this section, we give the theoretical guarantee of our proposed multi-stage Bayesian risk-averse Q-learning algorithm, which ensures the asymptotic convergence of the learned Q-function to the optimal" Q-function. We define the random observations from the real environment and all algorithmrelated random quantities with respect to a probability space (Ω, F, P). Recall that the real-world transition observations are assumed to be obtained from some behavior policy that is not controlled by the agents. This may possibly bring a definition problem of optimal" Q-function. Indeed, if all streaming data are generated by some greedy policy, then the number of transition observations from some state-action pair can remain small, in which case the Bayesian posterior does not converge and the learned policy is risk-averse yet random since it depends on the Bayesian posterior. As in the offline RL, where the policy is learned based on the initial fixed data set, we also define the optimal Q-function as depending on the given real-world data but differs from pure offline RL in that the given data is streaming and random. Definition 4.1. (Data-conditional optimal Q-function) Given an observation process ω = (st,i, at,i, s t,i)|t = 1, 2, . . . , i = 1, 2, . . . , n(t) , let O s,a,s (ω) = P t=1 Pn(t) i=1 1{(st,i, at,i, s t,i) = (s, a, s )} and O s,a(ω) = (O s,a,s (ω))s S denote the number of transition observations under ω. For each (s, a) S A, define the limiting posterior as (δpcs,a( ) if ||O s,a(ω)|| = , Dirichlet ϕ0 s,a + O s,a(ω) otherwise, where δpcs,a( ) is the Dirac measure centered at true transition probability pc s,a = (pc s,a(s ))s S. Qω, is the optimal solution to BRMDP with posterior" ϕω, which satisfies Qω, = T ϕωQω, . The data-optimal Q-function Qω, is a random vector since the observation process is random. With this data-conditional optimal criterion, we can now prove the convergence of the multi-stage Bayesian risk-averse Q-learning algorithm. A list of notations for different Q-functions can be found in Table 1 of the supplementary material. Algorithm 1 Multi-stage Bayesian risk-averse Q-learning Input: State space S, action space A, reward function r, termination stage T, Q-learning update step size {m(t)}T t=1, learning rate {λℓ}ℓ, prior distribution {ϕ0 s,a}s S,a A, minimal sample size N, risk measure ρ {Va Rα, CVa Rα} with risk level α. Initialize ϕ ϕ0, Q(s, a) 0, Ns,a N s, a . for t = 1 to T do Obtain n(t) observations (si, ai, s i) i = 1, . . . , n(t). for i = 1 to n(t) do for all (s, a) S A do ϕt s,a(s ) ϕt 1 s,a (s ) + #{(si, ai, s i) = (s, a, s )} if ϕt s,a = ϕt 1 s,a then Ns,a Ns,a + 1 else Ns,a max{Ns,a 1, N} end if end for end for for ℓ= 1 to m(t) do M Pt 1 τ=1 m(t), λt,l λM+ℓ for all (s, a) S A do Generate p1, . . . , p Ns,a ϕt s,a Xi f(pi|s, a, Q), i = 1, 2, . . . , Ns,a Q (s, a) (1 λt,ℓ)Q(s, a) + λt,ℓb T ϕt Ns,a Q(s, a) using (3). end for Q(s, a) Q (s, a) (s, a) S A end for end for Output: Q-function {Q(s, a)}(s,a) To prove the convergence of Qt, we prove (i) the convergence of Qϕt, which depends on the convergence of the posterior distribution, and (ii) the convergence of the estimator for the Bellman operator. We summarize two important results in Proposition 4.2 and Theorem 4.3. Due to the page limit, all the proofs are deferred to the supplementary material. Proposition 4.2. (Convergence of Qϕt, ) Let Qω, be defined as in Definition 4.1 and ϕt be the posterior distribution obtained at stage t. Then the optimal Q-function under posterior distribution ϕt converges to Qω, almost surely. That is, lim t ||Qϕt, Qω, || = 0 almost surely, where || || is the entry-wise sup norm. Compared with Theorem 2.6 which characterizes the difference between the BRMDP and the original MDP by computing a concentration bound, Proposition 4.2 ensures the strong convergence of the BRMDP model to the data-conditional optimal Q-function. The next Theorem 4.3 ensures that the bias of the proposed estimator for the Bellman operator with varying sample sizes converges to zero uniformly in Q. Theorem 4.3. (Diminishing bias for Bellman estimator) Denote by (k) the kth iteration of the Q-learning update. Let tk be the stage such that Ptk 1 t=1 m(t) < k Ptk t=1 m(t). Let N (k) s,a denote the sample size in iteration k and ωtk denote all the past observations until stage tk. Denote by b T (k)Q(s, a) = b T ϕtk N(k) s,a Q(s, a) as defined in (3). Then, we have almost surely, lim k sup ||Q|| R 1 γ E[b T (k)Q(s, a) T ϕtk Q(s, a)|ωtk] = 0. Theorem 4.3 provides a uniform convergence of the estimated Bellman estimator, which is crucial to prove Theorem 4.4. Notably, with both streaming data and risk measure, proof of Theorem 4.3 is much more challenging than existing work on distributionally robust Q-learning, where randomness only comes from the learning process, which refers to the random data generated by the simulator. In our setting, we have mixed randomness coming from both the observation process and learning process, which complicates the analysis and differs from that of pure offline Q-learning. Also, the analysis is different from online Bayesian Q-learning, where the sample complexity results are usually constructed assuming an optimal policy is resolved and deployed instantly in each period. In contrast, we aim to prove the convergence of Q-learning in order to obtain the optimal policy. Together with Proposition 4.2 and Theorem 4.3, we establish the convergence of Algorithm 1 in Theorem 4.4. Theorem 4.4. Denote by Qt the Q-function given by Algorithm 1 at the end of stage t. Assume T = , and the learning rate {λℓ} ℓ=1 satisfies P ℓ=1 λℓ= , P ℓ=1 λ2 ℓ< . Then, we have almost surely, lim t ||Qt Qω, || = 0. Corollary 4.5. Let Qt be defined as in Theorem 4.4 and O s,a be defined as in Definition 4.1. Assume T = and ||O s,a|| = , s S, a A almost surely. Then lim t ||Qt Qc, || = 0 almost surely , where Qc, (s, a) = X s S pc s,a(s )[r(s, a, s ) + max b A Qc, (s , b)] is the true optimal Q-function. Corollary 4.5 is straightforward from Theorem 4.4, which ensures Algorithm 1 will eventually learn the true optimal Q-function if we can obtain an infinite number of observations for each state-action pair. 5 Numerical Experiments 5.1 Comparison Baselines: BRQL-Va R: our proposed multi-stage Bayesian risk-averse Q-learning algorithm with risk measure Va R; BRQL-CVa R: our proposed multi-stage Bayesian risk-averse Q-learning algorithm with risk measure Cva R; BRQL-mean: the risk-neutral Bayesian Q-learning function. That is, the risk measure is the expectation taken with respect to the posterior distribution; DRQL-KL: distributionally robust Q-learning algorithm with KL divergence (see [14]); DRQL-Wass: distributionally robust Q-learning algorithm with Wasserstein distance (see [15]). 5.2 Testing examples Example 1: Coin Toss. Consider we are playing the following game. Each time we will toss K coins and observe the number of coins that show heads, where the chance of each coin showing heads is unknown. After observing the number of heads in the last toss, we can make a guess about whether the next toss will have more heads or fewer heads. If our guess is right, we can get 1 dollar as the reward, otherwise, we need to pay 1 dollar. We also have the choice of not guessing, in which case we do not pay or get paid. We model this game as a discounted infinite-horizon MDP. The state st S = {0, 1, . . . , K} denotes the number of heads in tth toss. The actions space is A = { 1, 0, 1}, where a = 1 corresponds to guess st+1 > st, a = 1 corresponds to guess st+1 < st, and a = 0 corresponds to not guess. Hence, the reward function is r(s, a, s ) := a1{ss } |a|1{s=s }. Assume each coin shows a head with probability 0.4. The number of coins K = 10. Example 2: Inventory Management. Suppose a warehouse manager runs a capacity system. At the beginning of period t, the manager observes the current inventory level st and orders additional goods of the amount at. An ordering cost of c = 1 is incurred for each unit of goods. The demand follows a truncated Poisson distribution with a mean of 3 and support {0, 1, . . . , K}. Suppose the demand arrives during each period and is fulfilled at the end of the period. For each unit of fulfilled demand, we can obtain a profit of u = 5. If a unit of demand is not fulfilled at the end of the period, the demand is lost and a penalty cost of q = 2 is incurred. If there are any remaining goods at the end of the period, the goods will be taken to the next period with a holding cost h = 1 per unit. The warehouse has a maximal capacity of K for the goods. We model this problem as a discounted infinite-horizon MDP with state space S = { K, . . . , 0, . . . , K}, where (st)+ = max(st, 0) represents the inventory level at the beginning of period t and (st) = min(st, 0) represents the lost demand at the end of period t 1. The action space is As := {0, . . . , K s+}. The reward is r(s, a, s ) = (c a + h (s )+ + q (s ) ) + u (s+ + a (s )+). We consider two settings: (I) the demand in each period uniformly distributes among {0, 1, . . . , K} and (II) the demand depends on the current inventory level s. For the second setting, we will consider the case where observations are insufficient to estimate the transition probability for every state-action pair. 5.3 Experiment Setting Coin toss. We consider stage-wise streaming observations with batch size n(t) = 1 and stage-wise Q-learning with number of steps m(t) = 1. The minimal sample size to estimate the Bellman operator N = 10. Initial observed data batch size n(0) = 10. We set the radius of the KL ball and the Wasserstein ball to be 0.1. The risk level for Va R and CVa R is set to 0.2 in Figure 2 and 0.4 in Figure 3. Inventory Management. In Figure 4, We set K = 10, T = 60, m(t) = n(t) = 5, n(0) = 20, and N = 10. The radius of the KL and the Wasserstein ball is 0.05, and the risk level for Va R and CVa R is 0.2. In Figure 5, we set the capacity K = 10 and the size of the historical data set n(0) = 30. The policies are deployed in different environments, where the demand follows different truncated Poisson distributions with means ranging from 3 to 5. 5.4 Results In Figure 24, we compare the value functions of different algorithms as the time stage increases. The value function of each algorithm is calculated by deploying the optimal policy in the real environment. Each curve shows the empirical expected performance and the strip around the curve shows the 95% confidence interval. In both examples, our proposed algorithms outperform the two distributionally robust Q-learning algorithms in both expected performance and variation as the time stage increases, since our proposed algorithms dynamically update the posterior to reduce the model uncertainty while two DRQL algorithms learns with fixed ambiguity set. Compared with the risk-neutral algorithm, BRQL-Va R and BRQL-CVa R achieve lower expected value functions but have smaller variations, which shows the robustness of our two proposed algorithms. Moreover, in Figure 5 We test for the insufficient data setting, where we only have a set of historical data to estimate the transition model at the initial time stage. The value function is calculated as deploying the learned policy in different environments with demand following different Poisson distributions. Figure 5 indicates the two proposed algorithms achieve higher value functions than the risk-neutral algorithm in the more adversarial setting (with Poisson parameter less than 3), showing their robustness. They also obtain lower value functions than two DRQL algorithms in the adversarial setting and higher value functions in other settings, indicating the risk measure is more flexible compared to the worst-case criterion. 6 Conclusion and Limitation In this paper, we propose a novel multi-stage Bayesian risk-averse Q-learning algorithm to learn the optimal policy with streaming data, by reformulating the infinite-horizon MDP with unknown transition model as an infinite-horizon BRMDP. In particular, we consider the two cases of risk Figure 2: Coin Toss: risk level α = 0.2 Figure 3: Coin Toss: risk level α = 0.4 Figure 4: Inventory Management: streaming observations Figure 5: Inventory Management: fixed BRMDP measures, Va R and CVa R, for which we design a Monte Carlo estimator with varying sample sizes to approximate the Bellman operator of BRMDP. We demonstrate the correctness of the BRMDP formulation by providing statistical guarantee and prove the strong asymptotic convergence of the proposed Q-Learning algorithm. The numerical results demonstrate that the proposed algorithms are efficient with streaming data and robust with limited data. As discussed in the paper, one limitation of the current framework is that the behavior policy that generates real-world observations is assumed to be given. This is suitable for situations when it is expensive for the agent to interact with the real environment or to change the policy frequently as it may cause a large cost or system instability. An interesting future direction is to consider an online learning setting, where at each period the agent also needs to take action in the real world. ACKNOWLEDGMENT The authors gratefully acknowledge the support by the Air Force Office of Scientific Research under Grants FA9550-19-1-0283 and FA9550-22-1-0244 and the National Science Foundation under Grant NSF-DMS2053489. AUTHOR BIOGRAPHIES Yuhao Wang is a Ph.D. student at the H. Milton Stewart School of Industrial and Systems Engineering at Georgia Institute of Technology. He received his B.S. degree from the Department of Mathematics at Nanjing University, China, in 2021. His research interests include simulation and stochastic optimization and reinforcement learning. His email address is yuhaowang@gatech.edu and his web page is https://sites.gatech.edu/yuhaowang/. Enlu Zhou is a Professor at the H. Milton Stewart School of Industrial and Systems Engineering at Georgia Institute of Technology. She received the B.S. degree with highest honors in electrical engineering from Zhejiang University, China, in 2004, and received the Ph.D. degree in electrical engineering from the University of Maryland, College Park, in 2009. Her research interests include simulation optimization, stochastic optimization, and stochastic control. Her email address is enlu.zhou@isye.gatech.edu, and her web page is https://www.enluzhou.gatech.edu/. [1] Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical finance, 9(3):203 228, 1999. [2] Jose H Blanchet, Peter W Glynn, and Yanan Pei. Unbiased multilevel monte carlo: Stochastic optimization, steady-state simulation, quantiles, and other applications. ar Xiv preprint ar Xiv:1904.09929, 2019. [3] Pablo Samuel Castro and Doina Precup. Using linear programming for bayesian exploration in markov decision processes. In IJCAI, volume 24372442, 2007. [4] Richard Dearden, Nir Friedman, and David Andre. Model-based bayesian exploration. ar Xiv preprint ar Xiv:1301.6690, 2013. [5] Erick Delage and Shie Mannor. Percentile optimization for markov decision processes with parameter uncertainty. Operations research, 58(1):203 213, 2010. [6] Joseph L Doob. Application of the theory of martingales. Le calcul des probabilites et ses applications, pages 23 27, 1949. [7] Michael O Duff. Monte-carlo algorithms for the improvement of finite-state stochastic controllers: Application to bayes-adaptive markov decision processes. In International Workshop on Artificial Intelligence and Statistics, pages 93 97. PMLR, 2001. [8] Michael O Gordon Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. University of Massachusetts Amherst, 2002. [9] Yaakov Engel, Shie Mannor, and Ron Meir. Bayes meets bellman: The gaussian process approach to temporal difference learning. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 154 161, 2003. [10] Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015. [11] Mohammad Ghavamzadeh and Yaakov Engel. Bayesian policy gradient algorithms. Advances in neural information processing systems, 19, 2006. [12] Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning, 8(5-6):359 483, 2015. [13] Yifan Lin, Yuxuan Ren, and Enlu Zhou. Bayesian risk markov decision processes. In Advances in Neural Information Processing Systems, 2022. [14] Zijian Liu, Qinxun Bai, Jose Blanchet, Perry Dong, Wei Xu, Zhengqing Zhou, and Zhengyuan Zhou. Distributionally robust q-learning. In International Conference on Machine Learning, pages 13623 13643. PMLR, 2022. [15] Ariel Neufeld and Julian Sester. Robust q-learning algorithm for markov decision processes under wasserstein uncertainty. ar Xiv preprint ar Xiv:2210.00898, 2022. [16] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26, 2013. [17] Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pages 9582 9602. PMLR, 2022. [18] Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete bayesian reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 697 704, 2006. [19] Marc Rigter, Bruno Lacerda, and Nick Hawes. Risk-averse bayes-adaptive reinforcement learning. Advances in Neural Information Processing Systems, 34:1142 1154, 2021. [20] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive pomdps. Advances in neural information processing systems, 20, 2007. [21] Alexander Shapiro. Interchangeability principle and dynamic equations in risk averse stochastic programming. Operations Research Letters, 45(4):377 381, 2017. [22] Apoorva Sharma, James Harrison, Matthew Tsao, and Marco Pavone. Robust and adaptive planning under model uncertainty. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, pages 410 418, 2019. [23] Nian Si, Fan Zhang, Zhengyuan Zhou, and Jose Blanchet. Distributionally robust policy evaluation and learning in offline contextual bandits. In International Conference on Machine Learning, pages 8884 8894. PMLR, 2020. [24] Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38(3):287 308, 2000. [25] Malcolm Strens. A bayesian framework for reinforcement learning. In ICML, volume 2000, pages 943 950, 2000. [26] Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. University of Massachusetts Amherst, 1984. [27] Tao Wang, Daniel Lizotte, Michael Bowling, and Dale Schuurmans. Bayesian sparse sampling for on-line reward optimization. In Proceedings of the 22nd international conference on Machine learning, pages 956 963, 2005. [28] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989. [29] Di Wu, Helin Zhu, and Enlu Zhou. A bayesian risk approach to data-driven stochastic optimization: Formulations and asymptotics. SIAM Journal on Optimization, 28(2):1588 1612, 2018. [30] Wenhao Yang, Liangyu Zhang, and Zhihua Zhang. Toward theoretical understandings of robust Markov decision processes: Sample complexity and asymptotics. The Annals of Statistics, 50(6):3223 3248, 2022. [31] Enlu Zhou and Wei Xie. Simulation optimization when facing input uncertainty. In 2015 Winter Simulation Conference (WSC), pages 3714 3724. IEEE, 2015. [32] Zhengqing Zhou, Zhengyuan Zhou, Qinxun Bai, Linhai Qiu, Jose Blanchet, and Peter Glynn. Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331 3339. PMLR, 2021. A Dirichlet Posterior on State-action Pair A Dirichlet distribution is parameterized by a count vector ϕ = (ϕ1, . . . , ϕk), where ϕi 1, such that the density of probability distribution p = (p1, . . . , pk) is defined as f(p | ϕ) Qk i=1 pϕi 1 i . For each (s, a) S A, we impose a Dirichlet prior with parameter ϕs,a = (ϕs,a(s ))s S on the unknown transition probability ps,a = (ps,a(s ))s S. After we observe the transition (s, a, s ) for os,a,s times s S, the posterior distribution of ps,a is also a Dirichlet distribution with parameter ϕs,a + os,a = (ϕs,a(s ) + os,a,s )s S, where os,a = (os,a,s )s S. B Technical Proofs Table 1: Checklist of notations for different Q-functions. Notation Explanation Qϕ, Optimal Q-function of BRMDP with posterior ϕ Qω, Data-conditional optimal Q-function conditioned on observation process ω Qt Q-function at stage t given by Algorithm 1 Qc, Optimal Q-function of the true environment We first introduce the following lemma that guarantees the maximum taken over randomized policies is equivalent to being taken over only deterministic policies. Lemma B.1. (Deterministic Bellman operator) LϕV (s) = max a A ρp ϕs,a(Es p[r(s, a, s ) + γV (s )]) = sup π Π Ea π(s)[ρp ϕs,a(Es p[r(s, a, s ) + γV (s )])]. Proof. For an arbitrary randomized policy π, an arbitrary V , denote by π(s, a) = P(π(s) = a). We have sup a A ρp ϕs,a(Es p[r(s, a, s ) + γV (s )]) X a A π(s, a)ρp ϕs,a(Es p[r(s, a, s ) + γV (s )]). LϕV (s) = max a A ρp ϕs,a(Es p[r(s, a, s ) + γV (s )]) sup π Π Ea π(s)[ρp ϕs,a(Es p[r(s, a, s ) + γV (s )])]. The other direction holds naturally since a deterministic policy is also a randomized policy. Lemma B.2. (Non-expansive risk measure) Suppose the risk measure ρ satisfies Assumption 2.3, then |ρ(f1(ξ)) ρ(f2(ξ))| ||f1(ξ) f2(ξ)||ξ, , where || ||ξ, is the sup norm with probability measure P. ρξ(f1(ξ)) = ρξ(f1(ξ)+f2(ξ) f1(ξ)) ρξ(f1(ξ)+ f2(ξ) f1(ξ) ξ, ) = ρξ(f1(ξ))+ f2(ξ) f1(ξ) ξ, . Similarly, we have ρξ(f2(ξ)) ρξ(f1(ξ)) + f2(ξ) f1(ξ) ξ, . Combining together we complete the proof. Proof of Theorem 2.4 Proof. proof of 1. ε > 0, there exists a deterministic policy π, such that s S LϕV (s) ρp ϕs,π(s)(Es p[r(s, π(s), s ) + γV (s )]) + ε. LϕV (s) LϕU(s) ρp ϕs,π(s)(Es p[r(s, π(s), s ) + γV (s )]) ρp ϕs,π(s)(Es p[r(s, π(s), s ) + γU(s )]) + ε γ||Es p[V (s ) U(s )]||p, + ε γ||V U|| + ε Since ε can be chosen arbitrarily, ||LϕV LϕU|| γ||V U|| . Switching the position of V and U, we obtain the desired result. proof of 2. For the first part, since Lϕ is a contraction mapping, there exists a unique V ϕ, such that LϕV ϕ, = V ϕ, . For the second part, let π be an arbitrary randomized policy. We have V ϕ, (s) = LϕV ϕ, (s) = sup a A {ρp1 ϕs,a(Es1 p1[r(s, a, s1) + γV ϕ, (s1)])} a A π(s, a)ρp1 ϕs,a(Es1 p1[r(s, a, s1) + γV ϕ, (s1)]) a A π (s, a)ρp1 ϕs,a(Es1 p1[r(s, a, s1) + γV ϕ, (s1)]) where the third equality holds because of Lemma B.1. Furthermore, by Assumption 2.3.2, we get a A π (s, a)ρp1 ϕs,a(Es1 p1[r(s, a, s1)+γ X a1 A π (s1, a1)ρp2 ϕs1,a1(Es2 p2[r(s1, a1, s2)+γV ϕ, (s2)])]). (4) Define Lϕ π such that s S and V R|S|, Lϕ π V (s) = X a A π (s, a)ρp1 ϕs,a(Es1 p1[r(s, a, s1) + γV (s1)]). By the same induction as (4), we can derive V ϕ, (Lϕ π )k V ϕ, , k 1. Define V 0,π (s) = X a A π (s, a)ρp ϕs,a(Es1 p[r(s, a, s1)]) and V k+1,π (s) = X a A π (s, a)ρp ϕs,a(Es1 p[r(s, a, s1) + γV k,π (s1)]). By Assumption 2.3 and Lemma B.2, we have V ϕ, (s) (Lϕ π )k V ϕ, (s) V k,π (s) γk||V ϕ, || (5) |V ϕ,π (s) V k,π (s)| γk+1||V ϕ,π || γk+1 where V ϕ,π is the value function of BRMDP with posterior ϕ and policy π . Then from (5) we obtain V ϕ, (s) V ϕ,π (s) γk||V ϕ, || γk+1 1 γ R, s S. By taking k , we have V ϕ, (s) V ϕ,π (s). Since π is chosen arbitrarily, we get the desired result. Proof of Theorem 2.6 Proof. Denote by f(p|s, a, V ) =Es p[r(s, a, s ) + γV (s )] = X s S p(s )[r(s, a, s ) + γV (s )]. We first bound |f(p|s, a, V ) f(p |s, a, V )| in terms of p p for V R 1 γ . To see this, |f(p|s, a, V ) f(p |s, a, V )| = x S [p(x) p (x)][r(s, a, x) + γV (x)] |S|[ R + V ] p p Now consider p ϕs,a. Since ϕ0 s,a = 1, we have the posterior parameter ϕs,a(s ) = os,a,s + 1, where o(s, a, s ) is the number of transition from s to s taken action a in the data set. Let c M denote the empirical MDP generated with the same data which are used to estimate ϕ, with the empirical transition probability bps,a(s ) = os,a,s Os,a . Let pc s,a be the true transition distribution. Since {os,a,s }s S follows the multinomial distribution with parameter Os,a; pc s,a, we have P(|bps,a(s ) pc s,a(s )| by Chebyshev inequality. Define ps,a(s ) = Ep ϕs,a[p(s )] = ϕs,a(s ) P s S ϕs,a(s ) = os,a,s + 1 Os,a + |S|, σ2 s,a(s ) = Ep ϕs,a[(p(s ) ps,a(s ))2] = ϕs,a(s )( ϕs,a 1 ϕs,a(s ) ϕs,a 2 1( ϕs,a 1 + 1) 1 ϕs,a 1 = 1 Os,a + |S|. |p(s ) ˆps,a(s )| |p(s ) ps,a(s )| + | ps,a(s ) ˆps,a(s )| =|p(s ) ps,a(s )| + os,a,s + 1 Os,a + |S| os,a,s |p(s ) ps,a(s )| + 1 Os,a . Since |S| 1, α 1, Os,a 1, q 3 s,a 1 Os,a 0. Then, with probability at least 1 α |p(s ) pc s,a(s )| 3 |p(s ) ps,a(s )| 3 3 s,a | ps,a(s ) ˆps,a(s )| |ˆps,a(s ) pc s,a(s )| |p(s ) ps,a(s )| 3 3 s,a 1 Os,a |p(s ) ps,a(s )| |p(s ) ps,a(s )| 2 3s,a |S| σ2 s,a(s ) (7) where the second last inequality holds by Chebyshev inequality. Replacing a with π(s), we obtain with probability at least 1 α |S|O 1 3 s,π(s) 1 α |S| O 1 |p(s ) pc s,π(s)(s )| 3 This further implies with probability at least 1 α |S| O 1 p pc s,π(s) 3 s S Pp ϕs,π(s) |p(s ) pc s,π(s)(s )| 3 Let U = {p| p pc s,π(s) 3 q 3 s,π(s)}. Since with probability at least 1 α |S| O 1 3 , Va Rα is the α-quantile, and Pϕs,a(U) 1 α, we have inf p U f(p|s, π(s), V ) (Va Rα)p ϕs,π(s)(f(p|s, π(s), V )) sup p U f(p|s, π(s), V ). Hence with probability at least 1 α |S| O 1 |(Va Rα)p ϕs,π(s)(f(p|s, π(s), V )) f(pc s,π(s)|s, π(s), V )| sup p U f(p|s, π(s), V ) f( ps,π(s)|s, π(s), V ) 1 γ sup p U p ps,π(s) 3 s,π(s) |S| R 1 γ where the second last equality is by (6). Since this holds for all s, then we obtain with probability 1 α O 1 |(Va Rα)p ϕs,π(s)(f(p|s, π(s), V )) f(ˆps,π(s)|s, π(s), V )| 3 holds for all s S. Finally, let Lϕ,π denote the Bellman operator for BRMDP with posterior ϕ and policy π, that is V ϕ,π = Lϕ,πV ϕ,π = (Va Rα)p ϕs,π(s)(f(p|s, π(s), V ϕ,π)). Similar as Theorem 2.4, Lϕ,π is a γ-contraction mapping. We have |V ϕ,π(s) V c,π(s)| =|Lϕ,πV ϕ,π(s) Lϕ,πV c,π(s) + Lϕ,πV c,π(s) V c,π(s)| Lϕ,πV ϕ,π Lϕ,πV c,π + |Lϕ,πV c,π(s) V c,π(s)| γ V ϕ,π V c,π + |Lϕ,πV c,π(s) V c,π(s)|. Since V c,π R 1 λ, we have the second term |Lϕ,πV c,π V c,π| 3 q 3 |S| R 1 γ by (8). Maximize over s on both sides, we obtain V ϕ,π V c,π λ V ϕ,π V c,π + O 1 α 3|S| R (1 γ). V ϕ,π V c,π O 1 α 3|S| R (1 γ)2 O 1 α 5|S| R (1 γ)2 . For ρ is CVa Rα, notice by (7), we have with probability at least 1 α |S|O 1 |p(s ) pc s,a(s )| 3 Hence, with probability at least 1 α |S|O 1 3 s,π(s) 1 α |S| O 1 p pc s,π(s) 3 Recall CVa Rα(X) is the conditional expectation on {X Va Rα(X)}. Let U = {p| p pc s,π(s) 3 s,π(s)} and W = {p|f(p|s, π(s), V ) (Va Rα)p ϕs,π(s)(f(p|s, π(s), V ))} for V R 1 λ. Then, with probability at least 1 α |S| O 1 |(CVa Rα)p ϕs,π(s)(f(p|s, π(s), V )) f(ˆps,π(s)|s, π(s), V )| W U (f(p|s, π(s), V )) f(ˆps,π(s)|s, π(s), V )d Pϕs,π(s) + 1 W U c(f(p|s, π(s), V )) f(ˆps,π(s)|s, π(s), V )d Pϕs,a (f(p|s, π(s), V )) f(ˆps,π(s)|s, π(s), V ) d Pϕs,π(s) + 1 (f(p|s, π(s), V )) f(ˆps,π(s)|s, π(s), V ) d Pϕs, 3 s,π(s) 2 R 1 γ + α 3 3 s,π(s) |S| R 1 γ Then, following the same proof as for Va Rα, with probability at least 1 α O 1 3 , we can bound V ϕ,π V c,π O 1 α 5|S| R (1 γ)2 . This completes the proof of the first part. For the second part, let πc be the optimal policy for Mc. Then we have for both Va Rα and CVa Rα, with probability at least 1 α O 1 3 (πc) 1 αO 1 V ϕ, V ϕ,πc V c,πc O 1 α 5|S| R (1 γ)2 . Recall π is the optimal policy for BRMDP, then V c,πc V c,π V ϕ, O 1 α 5|S| R (1 γ)2 . Combining together the proof is completed. Proof of Proposition 4.2 Proof. First notice ||Qϕt, Qω, || =||T ϕt Qϕt, T ϕωQω, || =||(T ϕt Qϕt, T ϕt Qω, ) + (T ϕt Qω, T ϕωQω, )|| γ||Qϕt, Qω, || + ||T ϕt Qω, T ϕωQω, || . Since ||Qω, || R 1 γ almost surely, we can prove the convergence of Qϕt, if we can show that almost surely for each (s, a) S A, sup ||Q|| R 1 γ |T ϕt Q(s, a) T ϕωQ(s, a)| 0 as t . (9) Fix a state-action pair (s, a), then for any observation process ω, if ||O s,a(ω)|| < , ϕt s,a = ϕω s,a after some time stage τ and (9) clearly holds for such ω. Otherwise, ||O s,a(ω)|| = . For those ω s, by Bayesian consistency [6], we know that for any neighborhood of pc s,a with parameter ε > 0, which is defined as Uε = n p R|S| + P x S p(x) = 1, |p(x) pc s,a(x)| ε, x S o , limt P(p Uε|ϕt s,a) = 1 almost surely. For any probability mass function p R|S| + , Recall f(p|s, a, Q) = Es p[r(s, a, s ) + γ maxb A Q(s , b)], we have p Uε, |f(p|s, a, Q) f(pc s,a|s, a, Q)| = x S [p(x) pc s,a(x)][r(s, a, x) + γ max b A Q(x, b)] ε|S|[ R + γ||Q|| ] In the following we fix a sample path ω such that ||O s,a(ω)|| = and drop the notation of ω for simplicity. We have almost surely: If ρ is Va Rα: Since P(p Uε|ϕt s,a) 1, for any risk level α > 0, P(p Uε|ϕt s,a) > 1 α for t large enough. Since Va Rα is the α-quantitle, we have Va R ϕt s,a α (f(p|s, a, Q)) inf p Uε f(p|s, a, Q) f(pc s,a|s, a, Q) ε |S| R Va R ϕt s,a α (f(p|s, a, Q)) sup p Uε f(p|s, a, Q) f(pc s,a|s, a, Q) + ε |S| R Combining the two inequalities above, we have |Va R ϕt s,a α (f(p|s, a, Q)) f(pc s,a|s, a, Q)| = |T ϕt Q(s, a) T ϕωQ(s, a)| ε |S| R Since this holds for any Q with ||Q|| R 1 γ and ε > 0, we know lim t sup ||Q|| < R 1 γ |T ϕt Q(s, a) T ϕωQ(s, a)| = 0. This completes the proof. If ρ is CVa Rα: CVa R ϕt s,a α (f(p|s, a, Q)) =Va R ϕt s,a α (f(p|s, a, Q)) + 1 αEq ϕts,a[Va R ϕt s,a α (f(p|s, a, Q)) f(q|s, a, Q)]+ =Va R ϕt s,a α (f(p|s, a, Q)) + 1 αEq ϕts,a,q Uε[Va R ϕt s,a α (f(p|s, a, Q)) f(q|s, a, Q)]+ αEq ϕts,a,q U cε [Va R ϕt s,a α (f(p|s, a, Q)) f(q|s, a, Q)]+ Since Va R ϕt s,a α (f(p|s, a, Q)) f(pc s,a|s, a, Q) and ϕt s,a(U c ε) 0, we have for large enough t, 1. |Va R ϕt s,a α (f(p|s, a, Q)) f(pc s,a|s, a, Q)| ε, 2. Eq ϕts,a,q Uε[Va R ϕt s,a α (f(p|s, a, Q)) f(q|s, a, Q)]+ 2ε |S| R 1 γ , 3. Eq ϕts,a,q U cε [Va R ϕt s,a α (f(p|s, a, Q)) f(q|s, a, Q)]+ P(p U c ε|ϕt s,a) 2 R 1 γ 2ε |S| R 1 γ . Then we obtain |CVa R ϕt s,a α (f(p|s, a, Q)) f(pc s,a|s, a, Q)| ε 1 + 4|S| R α(1 γ) Again, by arbitrary ε and the uniformness in Q, we obtain the desired result. Proof of Theorem 4.3 Proof. Proof sketch. The complete proof is technical. We first provide a proof sketch. Intuitively, the bias term converges to 0 since we either have posterior to concentrate on the true parameter or have an increasing sample size for the Monto Carlo estimator, both of which reduce the bias term to zero asymptotically. However, a major difficulty is uniform convergence (in terms of all possible values of Q and sample size N), for which existing results do not give a straightforward guarantee. We prove the results by considering the two cases. First, when infinite observations are available, i.e., ||O s,a|| = , we construct i.i.d. samples h(pi) f(pi|s, a, Q) with the same sampled pi for an arbitrary given ε, where f(pi|s, a, Q) is defined as in (2). with probability at least 1 ε, h(pi) takes a value no less than f(pi|s, a, Q) by a constant multiple of ε. with probability at least ε, h(pi) takes a value bounded by a constant. With the help of the Stirling formula and subtle calculation, we can bound the expectation of order statistics of h(pi), which further lower bounds the Bellman operator estimator. The upper bound can be obtained in a similar way. Together we can bound the bias term. Second, in the case of limited observations, we prove a uniform convergence of empirical distribution for f(pi|s, a, Q) when Q belongs to a bounded set. We carefully partition the probability space as a union of disjoint rectangular sets, to which the Glivenko-Cantelli theorem can be applied and the uniform convergence of the empirical distributions follows. Combining the two cases we complete the proof. Formal proof. Again, we fix a state-action pair (s, a), an observation process ω for which we drop the notation for simplicity. Let Q be any Q-function such that ||Q|| R 1 γ . We first prove for the Va R risk functional. Estimator for Va R we consider the two cases: O s,a = , then ϕt s,a is consistent at the true parameter pc s,a. ε > 0, P(p Uε|ϕt s,a) 1 almost surely for such ω, where Uε = n p R|S| + P x S p(x) = 1, |p(x) pc s,a(x)| ε, x S o . Then ε > 0, P(p Uε|ϕt s,a) 1 ε for all large t. At iteration k such that tk large enough, we ob- tain p(k) 1 , p(k) 2 , . . . , p(k) N(k) s,a ϕtk s,a. For each p(k) i , we have P(p(k) i Uε|ϕtk s,a) 1 ε. Furthermore, since ϕtk s,a is always smooth function, we can find Φ(k) i Uε such that P(p(k) i Φ(k) i |ϕtk s,a) = 1 ε. Recall f(p|s, a, Q) = X s S p(s )[r(s, a, s ) + γ max b A Q(s, a)]. h(p(k) i ) = inf p Uε f(p|s, a, Q) if p(k) i Φ(k) i inf p f(p|s, a, Q) if p(k) i Φ(k) i It is easy to see f(p(k) i |s, a, Q) h(p(k) i ). Notice the distribution of sampled p(k) i only depends on ϕtk s,a, N (k) s,a , which are further determined by ωtk. Conditioned on ωtk, h(p(k) i ), i = 1, 2, . . . , N (k) s,a are i.i.d. random variables taking two values infp Uε f(p|s, a, Q) and infp f(p|s, a, Q) with probability at least 1 ε and ε, respectively. Notice infp f(p|s, a, Q) R 1 γ . And | inf p Uε f(p|s, a, Q) Va Rϕ tk s,a α (f(p|s, a, Q))| | inf p Uε f(p|s, a, Q) f(pc s,a|s, a, Q)| + |f(pc s,a|s, a, Q) Va Rϕ tk s,a α (f(p|s, a, Q))|. By (10), we have | infp Uε f(p|s, a, Q) f(pc s,a|s, a, Q)| ε |S|||Q|| 1 γ . By (11), we have |f(pc s,a|s, a, Q) Va Rϕ tk s,a α (f(p|s, a, Q))| ε |S|||Q|| 1 γ for k large enough. Hence, we obtain | inf p Uε f(p|s, a, Q) Va Rϕ tk s,a α (f(p|s, a, Q))| 2ε|S|||Q|| where ||Q|| again can be bounded by R 1 γ . Denote by Xi = f(p(k) i |s, a, Q) and Yi = h(p(k) i ). We first bound the conditional expectation of the order statistic Y N(k) s,aα :N(k) s,a , i.e., E[Y N (k) s,aα :N(k) s,a |ωtk]. Since {Yi|ωtk} are i.i.d., we can compute the conditional mass function of Y nα :n as P(Y nα :n = inf p f(p|s, a, Q)|ωtk) = X εj(1 ε)n j. (12) We now lower bound the conditional expectation of Y nα :n by upper bound (12). First, we can show n j εj(1 ε)n j is decreasing in term of j for j > nε. To see this, take logarithm on the right hand side we have η(j) = log(n!) log((n j)!j!) + j log ε + (n j) log(1 ε). Compute the difference η(j + 1) η(j) = log(n j j + 1 ε 1 ε). Then η(j + 1) η(j) < 0 j (n + 1)ε 1. Hence for ε < α and j nα nε > (n + 1)ε 1, η(j) is decreasing. Hence we can upper bound (12) by (n + 1 nα ) n nα ε nα (1 ε)n nα . =(n + 1 nα ) n! (n nα )! nα !ε nα (1 ε)n nα (13) By Stirling Formula, n e 1 12n < 2 Then we have (13) < (n + 1 nα ) 2 e n ε nα (1 ε)n nα 2π(n nα ) n nα = (n + 1 nα ) 2ε nα (1 ε)n nα q For n 2 α(1 α), we have 1. (n + 1 nα ) 2n(1 α), 2. ε nα εnα, 3. (1 ε)n nα (1 ε)n(1 α)/(1 ε), n ) nα ( nα n )nα+1 αnα α, n )n nα (1 nα n )n(1 α) ( 1 α 2 π n(1 α) εnα(1 ε)n(1 α)/(1 ε) q 2 )n(1 α) nααnα α (1 ε)α 3 2 π n εα(1 ε)1 α = C 1 ε nβn, where C = 4 α 3 2 π is a constant and β = εα(1 ε)1 α 2 )1 α . For ε small enough, we can ensure ζ(x) = log( xβx) = 1 2 log x + x log β. Then, since log β < 0, ζ(x) attains the maximum at x = 2 log β . Then we have C 1 ε nβn = C 1 εeζ(n) C 1 εeζ( 2 log β ) 2 log β e 2 For n 2 α(1 α) , we have P(Y nα :n = inf p f(p|s, a, Q)|ωtk) P(Y1:n = inf p f(p|s, a, Q)|ωtk) 1 (1 ε) 2 α(1 α) C (ε) := max{ C 1 ε 2 log β e 2, 1 (1 ε) 2 α(1 α) }, where C (ε) 0 as ε 0. Then we have E[Y nα :n|ωtk] =P(Y nα :n = inf pi f(p|s, a, Q)|ωtk) (inf p f(p|s, a, Q)) + P(Y nα :n = inf p Uε f(p|s, a, Q)|ωtk) inf p Uε f(p|s, a, Q) P(Y nα :n = inf p f(p|s, a, Q)|ωtk) ( R 1 γ ) + P(Y nα :n = inf p Uε f(p|s, a, Q)|ωtk) (Va Rϕ tk s,a α (f(p|s, a, Q)) 2ε|S|||Q|| ) C (ε) ( R 1 γ ) + (1 C (ε)) (Va Rϕ tk s,a α (f(p|s, a, Q)) 2ε|S| R 1 γ ) Hence we have almost surely, E[X N(k) s,aα :N(k) s,a Va R ϕt s,a α (f(p|s, a, Q))|ωtk] E[Y N(k) s,aα :N(k) s,a Va Rϕ tk s,a α (f(p|s, a, Q))|ωtk] C (ε) ( R 1 γ ) + Va Rϕ tk s,a α (f(p|s, a, Q)) 2ε(1 C (ε))|S| R 1 γ 2(C (ε) + ε(1 C (ε))|S|) R 1 γ where C(ε) 0 as ε 0. Similarly by constructing h(p(k) i ) = sup p Uε f(p|s, a, Q) if p(k) i Ψi sup p f(p|s, a, Q) if p(k) i Ψi We can obtain E[X N(k) s,aα :N(k) s,a Va Rϕ tk s,a α (f(p|s, a, Q))|ωtk] b C(ε), almost surely for k large enough and b C(ε) 0 as ε 0. Notice T ϕtk Q(s, a) = Va Rϕ tk s,a α (f(p|s, a, Q)). Furthermore, since both C(ε) and b C(ε) do not depend on Q and by arbitrary ε, we obtain lim k sup ||Q|| R 1 γ E[b T (k)Q(s, a) T ϕtk Q(s, a)|ωtk] = 0. If O s,a < , then the posterior distribution on (s, a) remains the same as ϕω s,a for all large t and the sample size N (k) s,a tends to infinity. In this case, by Glivenko-Cantelli theorem, we know almost surely the empirical distribution conditioned on ϕω s,a (which is further determined by ω), F (k) N(k) s,a , for the sampled p(k) 1 , p(k) 2 , . . . , p(k) N (k) s,a uniformly converges to the distribution ϕω s,a as sample size N (k) s,a goes to infinity. Denote by X(k) i = f(p(k) i |s, a, Q) = P s S p(k) i (s )(r(s, a, s )+γ maxb A Q(s , b)) = P s S ds p(k) i (s ) = d p(k) i . We want to show uniform convergence of empirical distribution of {X(k) i }i=1,2,...,N (k) s,a conditioned on ϕω s,a for d D uniformly, where D contains all the possible value d can take. Denote by Gd n the empirical distribution conditioned on ωtk of {X(k) i }i=1,2,...,n in terms of d and Gd the true distribution of d p for p ϕω s,a. We want to show lim n sup d D,x |Gd n(x) Gd(x)| 0 almost surely. Notice that |d| R 1 γ and (d+ε1) ϕ = d ϕ+ε, we have Gd n(x) = Gd+ε1 n (x+ε). Hence we only need to prove for D = {d R|S| : M1 d 1}, where M = 2 R 1 γ + 1. Denote by Pn(K) = 1 n Pn i=1 1{pi K} where p1, , pn ϕω s,a are i.i.d samples. For an arbitrary ε = 1 e N , we have s=1 dsqs x, (ki 1)ε < qi kiε, i = 2, . . . , |S|) s=2 ds(ks 1)ε), (ki 1)ε < qi kiε, i = 2, . . . , |S| Since the last probability is just some probability of rectangulars, it can be expressed by finite number of cumulative distribution function, which by the Glivenko-Cantelli theorem, uniformly converge to the true distribution. Then, almosdt surely there exists nε > 0 (independent of d, ε, ki), for n nε, we have each s=2 ds(ks 1)ε), (ki 1)ε < qi kiε, i = 2, . . . , |S| s=2 ds(ks 1)ε), (ki 1)ε < qi kiε, i = 2, . . . , |S| + 1 e N |S| Hence we obtain k|S|=1 Pq ϕω s,a s=2 ds(ks 1)ε), (ki 1)ε < qi kiε, i = 2, . . . , |S| k|S|=1 Pq ϕω s,a s=1 dsqs x + s=2 dsε, (ki 1)ε < qi kiε, i = 2, . . . , |S| s=1 dsqs x + s=1 dsqs x + (|S| 1)M Since this holds for any d, x, we have sup d D,x R e N + sup d D,x R s=1 dsqs x + (|S| 1)M The last term converges to zero as e N . This is indicated by s=1 dsqs x + (|S| 1)M s=1 dsqs x + (|S| 1)M s=1 Pq ϕω s,a ds + (|S| 1)M s=1 Pq ϕω s,a ds + (|S| 1)M Since each 0 qs 1 and its marginal distribution ϕω s,a(qs) is continuous, it is uniformly continuous in 0 q 1. Hence each Pq ϕω s,a x ds < qs x ds + (|S| 1)M converge uniformly to 0 as e N . Hence by arbitrary e N, we have lim sup n sup d D,x R Similarly, we can obtain the other side as lim inf n inf d D,x R sup d D,x |Gd n(x) Gd(x)| 0 almost surely. We obtain the uniform convergence for empirical distribution in d, or equivalently, Q, conditioned on the observation process ω. Hence the quantile estimator X nα :n Va R ϕω s,a α (Q) uniformly in Q almost surely. Furthermore, since Xi is bounded, ||Q|| R 1 γ E[ b T (k)Q(s, a) T ϕtk Q(s, a)|ωtk] = 0 almost surely . Estimator for CVa R Given sample p(k) 1 , p(k) 2 , . . . , p(k) N(k) s,a and Xi = f(p(k) i |s, a, Q) = P s S p(k) i (s )[r(s, a, s ) + γ maxb A Q(s , b)]. The Monto Carlo estimator for CVa Rϕ tk s,a α (f(p|s, a, Q)) is 1 N (k) s,aα P N(k) s,aα j=1 Xj:N(k) s,a . Again we consider the two cases: O s.a = . Then we know almost surely ϕt s,a is consistent at pc s,a. Then by the proof for bias of Va R estimator, 1 > ε > 0, E[X N(k) s,aε :N(k) s,a |ω] Va Rϕ tk s,a ε (f(p|s, a, Q)) as k uniformly in Q almost surely. Then N (k) s,a α N (k) s,aα X j=1 Xj:N(k) s,a |ωtk N (k) s,a α N (k) s,aε X j=1 Xj:N(k) s,a |ωtk N (k) s,a α N(k) s,aα X j= N(k) s,aε Xj:N(k) s,a |ωtk N (k) s,a ε N (k) s,a α R 1 γ + N (k) s,a α N (k) s,a ε N (k) s,a α E[X N (k) s,aε :N(k) s,a |ωtk] Since E[X N(k) s,aε :N(k) s,a |ϕtk s,a] Va Rϕ tk s,a ε (f(p|s, a, Q)), Va Rϕ tk s,a ε (f(p|s, a, Q)) f(pc s,a|s, a, Q), CVa R ϕt s,a α (f(p|s, a, Q)) f(pc s,a|s, a, Q) uniformly for R 1 γ Q R 1 γ as k and ε is chosen arbitrarily, we have lim infk {E 1 N(k) s,aα P N (k) s,aα j=1 Xj:N(k) s,a |ωtk CVa Rϕ tk s,a α (f(p|s, a, Q))} 0 uniformly in Q almost surely. Similarly we can also prove lim supk {E h 1 nα P nα j=1 Xj:n|ωtk i CVa Rϕ tk s,a α (f(p|s, a, Q))} 0 uniformly in Q almost surely. Combining the two inequalities together, we obtain lim k sup ||Q|| R 1 γ E[b T (k)Q(s, a) T ϕtk Q(s, a)|ωtk] = 0 almost surely . For O s,a( ) < . By proof of Va R estimator we know the empirical distribution of f(p(k) i |s, a, Q) conditioned on ω, converge uniformly to ϕω s,a. Hence the CVa R estimator also has the desired result. Lemma B.3. [24] Consider a stochastic process (αt, t, g) , t 0, where αt, t, g : X R satisfy the equations t+1(x) = (1 αt(x)) t(x) + αt(x)g(x), x X, t = 0, 1, 2, . . . Let Pt be a sequence of increasing σ-fields such that α0 and 0 are P0-measurable and αt, t and Ft 1 are Pt-measurable. Let || ||W denote some weighted maximum norm. Assume that the following hold: 1. the set X is finite. 2. 0 αt(x) 1, P t αt(x) = , P t α2 t(x) < w.p.1. 3. E {g( ) | Pt} W κ t W + ct, where κ [0, 1) and ct converges to zero w.p.1. 4. Var {g(x) | Pt} K (1 + t W )2, where K is some constant. Then, t converges to zero with probability one (w.p.1). Proof for Theorem 4.4. Proof. Following the same notation in Theorem 4.3. Denote by Q(k) the Q-function in iteration k. Define bias(k)(s, a) = E[b T (k)Q(k)(s, a) T ϕtk Q(k)(s, a)|ωtk]. Then by Theorem 4.3 and observing ||Q(k)|| < R 1 γ almost surely, we have almost surely, lim k bias(k)(s, a) = 0 s S, a A. Denote by (k) = Q(k) Qω, . (k) satisfies (k+1)(s, a) = (1 λk) (k) + λk F(k), where F(k) = b T (k)Q(k) Qω, . Let H(k) be the σ-field generated by {{ (κ)}k κ=1, {λ(κ)}k κ=1, {F(κ)}k 1 κ=1}. We have |E[F(k)(s, a)|H(k)]| =|E[E[F(k)(s, a)|ωtk, H(k)]|H(k)]| =|E[T (k)(Q(k))(s, a) Qϕtk , (s, a) + Qϕtk , (s, a) Qω, (s, a) + bias(k)(s, a)|H(k)]| =|E[b T (k)(Q(k) Qϕtk , )|H(k)] + E[Qϕtk , (s, a) Qω, (s, a)|H(k)] + E[bias(k)(s, a)|H(k)]| γE[max s S | max b A Q(k)(s , b) max b A Qϕtk , (s , b)||H(k)] + E[||Qϕtk , Qω, || |H(k)] + E[|bias(k)(s, a)||H(k)] γE[||Q(k) Qϕtk , || |H(k)] + E[||Qϕtk , Qω, || |H(k)] + E[|bias(k)(s, a)||H(k)] γE[||Q(k) Qω, || |H(k)] + (1 + γ)E[||Qϕtk , Qω, || |H(k)] + E[|bias(k)(s, a)||H(k)] γ|| (k)|| + (1 + γ)E[||Qϕtk , Qω, || |H(k)] + E[|bias(k)(s, a)||H(k)]. where || || denote the entry-wise sup norm. The second equality holds since the Monto Carlo sampling procedure in iteration k only depends on the posterior ϕtk and sample size N (k) s,a , which are completely determined by the past observations ωtk. By Proposition 4.2 and Theorem 4.3, we know both ||Qϕtk , Qω, || and |bias(k)(s, a)| converge to 0 almost surely. Since they are both bounded, by the bounded convergence theorem the last two terms in the last inequalities converge to zero almost surely. Furthermore, since |F(k)(s, a)| R + || (k)|| , we have Var(F(k)(s, a)|H(k)) ( R + || (k)|| )2 ( R2 + 1)(1 + || (k)|| )2. By Lemma B.3, we obtain || (k)|| converges to 0 almost surely, which completes the proof.