# robust_reinforcement_learning_with_general_utility__c788fce8.pdf Robust Reinforcement Learning with General Utility Ziyi Chen, Yan Wen, Zhengmian Hu, Heng Huang Department of Computer Science, Institute of Health Computing, University of Maryland College Park College Park, MA 20742, USA {zc286,ywen1,zhu123,heng}@umd.edu Reinforcement Learning (RL) problem with general utility is a powerful decision making framework that covers standard RL with cumulative cost, exploration problems, and demonstration learning. Existing works on RL with general utility do not consider the robustness under environmental perturbation, which is important to adapt RL system in the real-world environment that differs from the training environment. To train a robust policy, we propose a robust RL framework with general utility, which subsumes many existing RL frameworks including RL, robust RL, RL with general utility, constrained RL, robust constrained RL, pure exploration, robust entropy regularized RL, etc. Then we focus on popular convex utility functions, with which our proposed learning framework is a challenging nonconvex-nonconcave minimax optimization problem, and design a two-phase stochastic policy gradient type algorithm and obtain its sample complexity result for gradient convergence. Furthermore, for convex utility on a widely used polyhedral ambiguity set, we design an algorithm and obtain its convergence rate to a global optimal solution. 1 Introduction Reinforcement learning (RL) is an important decision-making framework [41] aiming to find the optimal policy that minimizes accumulative cost, which is also a linear utility function of occupancy measure. Recent works have extended standard RL to more general utility functions to account for a variety of practical needs, including risk-sensitive applications [22, 8, 52], exploration maximization [18, 54, 51, 13, 6], and safety constraints [54, 51, 13]. There are provably convergent algorithms to solve RL with general utility [54, 55, 6]. However, these works study RL with general utility in a fixed environment, which may fail in many applications where the policy is trained in a simulation environment but implemented in a different real-world environment [37, 56]. To make the policy robust to such environmental change, robust RL has been proposed to find the optimal robust policy under the worst possible environment [36, 20, 48, 45, 14]. However, all the existing robust RL works restrict to linear utility function to our knowledge. Therefore, we aim to answer the following research question: Q: Can we train a robust policy for RL with general utility and obtain convergence results? 1.1 Our Contributions We affirmatively answer this question by proposing robust RL with general utility, the first learning framework that obtains a robust policy for general utility in the worst possible environment. It is formulated as a minimax optimization problem minθ Θ maxξ Ξ f(λθ,ξ) where f is the utility function and λθ,ξ is the occupancy measure under the policy parameter θ Θ and the environmental 38th Conference on Neural Information Processing Systems (Neur IPS 2024). transition kernel parameter ξ Ξ. Our robust RL with general utility is a combination of its two important special cases, namely, RL with general utility [54] (formulated as minθ Θ f(λθ,ξ) where the environmental parameter ξ is fixed) and robust RL [36] where f is restricted to linear utility function. This new learning framework also covers many other existing RL frameworks including constrained RL [2] and robust constrained RL [43] with safety critical applications such as healthcare and unmanned drones, entropy regularized RL [10] and its robust extension [32] which help agents learn from human demonstration, pure exploration [18] with application to explore an environment with sparse reward signals and its robust extension, etc. These examples use convex utility functions f, which is the focus of this paper. See Section 2.1 for details of these examples. Then, we focus on designing provably convergent algorithms for our new proposed learning framework with the widely used convex utility function f. In this case, our objective minθ maxξ f(λθ,ξ) is still a highly challenging nonconvex-nonconcave minimax optimization problem. Hence, we have to utilize the structure and properties of λθ,ξ to design algorithms and obtain convergence results. To elaborate, we design a projected stochastic gradient descent ascent algorithm with two phases. Interestingly, the first phase targeted at the objective function f obtains a stationary point of a different envelope function. Hence, we add a second phase targeted at a corrected objective ef to converge to a near-stationary solution of the original objective f. The convergence analysis is non-trivial with two novel techniques. First, we have proved a projected gradient dominance property (Proposition 4) that is much stronger than the existing one on convex utility, with less assumptions, no bias term and applicability to more general parameterized policy. Second, in the convergence analysis of the second phase, we obtain convergence to a global Nash equilibrium (thus a stationary point) of ef by Proposition 4, which is close to a stationary point of f by proving that ξ ef(λθ,ξ) ξf(λθ,ξ). Furthermore, with convex utility function f and the widely used s-rectangular polyhedral ambiguity set Ξ (including the popular L1 and L ambiguity sets), we design an alternative algorithm which converges to a global optimal solution of this nonconvex-nonconcave optimization problem at a sublinear rate O(1/K) (Theorem 3). This is much more challenging than global convergence for convex RL (that is, RL with convex utility function and fixed ξ) [54, 51, 6] and for robust RL with linear utility satisfying Bellman equation [36, 20, 45, 15, 25], so we need novel algorithm design and novel techniques. First, we prove that arg maxξf(λθ,ξ) can be obtained in the finite set of vertices V (Ξ) (Proposition 6). This is intuitive if f(λθ, ) is a convex function but in many applications, only f(λ) is convex. To solve this challenge, we prove a novel local invertibility property of λθ, (Proposition 5) by checking Bellman equation of λθ,ξ state by state in two cases. Then we prove Proposition 6 using a novel state-by-state extension from an optimal non-vertex ξ to an optimal vertex ξ. Second, the major difficulty to design our algorithm is to find a descent direction of Γ(θ) := maxξ f(λθ,ξ). We select the near-optimal vertices ξ Ξk V (Ξ) that may affect the optimization progress Γ(θk+1) Γ(θk), and find the descent direction with provably large descent for all the corresponding functions {f(λθk,ξ)}ξ Ξk (Proposition 7) via convex optimization. Third, by Proposition 6, the global convergence measure k := Γ(θk) minθ Γ(θ) at each iteration k either is O(1/k)-close to optimal (Γ(θk) O(1/k)) or enjoys large descent (Eq. (26)), so we prove the convergence in 3 cases: O(1/K)-optimal final θK, iterate Eq. (26) from θ0 or from a O(1/k)-optimal intermediate θk. 1.2 Related Works RL with General Utility. Standard RL aims to optimize over the accumulated reward/cost [21, 41]. Some early operation research works focus on other non-linear objectives such as variance-penalized MDPs [12], risk-sensitive objectives [22, 8, 52], entropy exploration [18], constrained RL [2, 1, 35] and learning from demonstration [39, 3]. [54] proposes RL with general utilities to cover the above applications and applies variational policy gradient method that provably converges to the global optimal solution for convex utility. [55] proposes a variance reduced policy gradient algorithm which requires e O(ϵ 3) samples to achieve an ϵ-stationary policy for general utility and e O(ϵ 2) samples to achieve an ϵ-global optimal policy for convex utility and overparameterized policy. [51] provides a meta-algorithm to solve the convex MDP problem as a min-max game between a policy player and a cost player who produces rewards that the policy player must maximize. They further show that any method-solving problems under the standard RL settings can be used to solve the more general convex MDP problem. [27] obtains policy gradient theorem for RL with general utilities. [6] proposes a simpler single-loop parameter-free normalized policy gradient algorithm with recursive momentum variance reduction. This algorithm requires e O(ϵ 3) samples to achieve ϵ-stationary policy in general and e O(ϵ 2) samples to achieve ϵ-global optimal policy for convex utility. For large finite state action spaces, it requires e O(ϵ 4) samples to achieve ϵ-stationary policy via linear function approximation of the occupancy measure. [53] proposes decentralized multi-agent RL with general utilities. [13] shows that convex RL is a subclass of multi-agent mean-field games. Robust RL. Robust RL is designed to learn a policy that is robust to perturbation of environmental factors. Usually robust RL is NP-hard [45], but becomes tractable for ambiguity sets that is (s, a)- rectangular [36, 20, 45, 44, 29, 56] or s-rectangular [45, 42, 23, 26]. Methods to solve robust RL include value iteration [36, 20, 45, 15, 25], policy iteration [20, 4, 24] and policy gradient [29, 44, 56, 42, 26, 17, 28]. 2 Robust Reinforcement Learning with General Utility Notations. The space of probability distribution on a space X is denoted as X . If X is finite, we denote its cardinality as |X|. p denotes p-norm of vectors and = 2 by default. Reinforcement Learning with General Utility. Reinforcement Learning (RL) with general utility is an emerging learning framework [54, 55, 6], specified by a tuple S, A, pξ, f, ρ, γ , with finite state space S, finite action space A, transition kernel pξ ( S)S A parameterized by ξ Ξ (Ξ RdΞ is convex and compact), discount factor γ (0, 1), general utility function f : S A R and the distribution ρ S of the initial state s0. At time t, given the environmental state st, the agent takes action at πθ( |st) based on a policy πθ ( A)S parameterized by θ Θ (Θ RdΘ is convex). Then the environment transitions to state st+1 pξ( |st, at). The occupancy measure λθ,ξ S A at (s, a) S A is defined below. λθ,ξ(s, a) def = (1 γ) t=0 γt Pπθ,pξ(st = s, at = a|s0 ρ), (1) where Pπθ,pξ denotes the probability measure of the Markov chain {st, at}t 0 induced by policy πθ, transition kernel pξ and the initial distribution ρ. The aim of the agent is to find the optimal policy πθ that solves minθ Θ f(λθ,ξ) given fixed transition kernel pξ. Here, f(λθ,ξ) can be seen as the overall cost of selecting policy πθ in the environment pξ, and there are many examples of the utility function f covering a variety of applications (See Section 2.1). However, existing works on RL with general utility assume a fixed environmental transition kernel pξ, which may fail in many applications where the policy is deployed in a real-world environment different from the simulation environment for training. To obtain a policy that is robust to such environmental change, we propose a new learning framework called robust RL with General Utility as follows. Our Proposed Robust RL with General Utility. The goal of our proposed robust RL with general utility is to find an optimal robust policy under the worst possible environmental parameter ξ from an ambiguity set Ξ, as formulated by the following minimax optimization problem with general utility function f. min θ Θ max ξ Ξ f(λθ,ξ), (2) In practice, the distance between the real-world environment (for deployment) and simulation environment (for training) is assumed to be bounded. Therefore, Ξ is usually set as a neighborhood around the nominal kernel bξ estimated from the simulation environment, i.e. Ξ = {ξ RdΘ : d(ξ, bξ) d0} with distance measure d and the distance upper bound d0 0. 2.1 Examples of Our Robust RL with General Utility Example 1: RL with General Utility. When Ξ = {bξ} for a fixed environmental parameter bξ, our proposed robust RL with general utility (2) reduces to (non-robust) RL with general utility minθ Θ f(λθ,bξ), as introduced above. Example 2: Robust Constrained RL and Its Special Cases. Robust constrained RL [38, 43, 40] is an emerging learning framework where an agent should obey safety conditions in all possible real-world environments, which is important in safety critical applications such as healthcare and unmanned aerial vehicle [43]. For math formulation, denote c(0), c(1), . . . , c(K) as cost functions S A R. At time t, the agent receives performance-related cost c(0)(st, at) and safety-related costs {c(k)(st, at)}K k=1. Define value functions V (k) θ,ξ and robust value functions V (k) θ as follows. V (k) θ,ξ def = c(k), λθ,ξ = X s,a c(k)(s, a)λθ,ξ(s, a), V (k) θ def = max ξ Ξ V (k) θ,ξ , k = 0, 1, . . . , K. (3) Then robust constrained RL is formulated as the following constrained policy optimization problem. min θ Θ V (0) θ , s.t. V (k) θ τk for all k = 1, . . . , K, (4) where τk R is the safety threshold, and V (k) θ τk means that the safety constraints V (k) θ,ξ τk holds for any environmental parameter ξ Ξ. Proposition 1. The robust constrained RL problem (4) is a special case of our proposed robust RL with general utility (2) using the following convex utility function f. f(λ) = c(0), λ , if c(k), λ τk for all k = 1, . . . , K + , otherwise . (5) After removing the safety constraints, robust constrained RL reduces to an important special case called robust RL (formulated as minθ Θ maxξ Ξ c(0), λθ,ξ with linear utility function f(λ) = c(0), λ ) [36]. Furthermore, when Ξ = {bξ} for fixed bξ, robust constrained RL and robust RL reduce to constrained RL [2] and RL [41] respectively. All these examples are important special cases of our proposed robust RL with general utility based on Proposition 1. Example 3: Robust Entropy Regularized RL and Its Special Cases. Robust entropy regularized RL is also an important RL framework with application to imitation learning and inverse reinforcement learning which help agents learn from human experts demonstration [32, 33], and is formulated as the following minimax optimization problem. min θ Θ max ξ Ξ λθ,ξ(s, a)c(s, a) µ X λθ,ξ(s)H[πθ( |s)] , (6) where c is a cost function, λθ,ξ(s) = P a λθ,ξ(s, a) is the state occupancy measure, and H[πθ( |s)] = P a πθ(a|s) log πθ(a|s) is the entropy regularizer (with coefficient µ 0) which encourages the agent to explore more states and actions and helps to prevent early convergence to sub-optimal policies. Proposition 2. The robust entropy regularized RL problem (6) is a special case of our proposed robust RL with general utility (2) using the following convex utility function f. s,a λ(s, a) h c(s, a) + µ log λ(s, a) P When µ = 0, robust entropy regularized RL (6) reduces to robust RL [36]. When c 0 but µ > 0, robust entropy regularized RL reduces to robust pure exploration. Furthermore, when Ξ = {bξ}, robust entropy regularized RL, robust RL and robust pure exploration reduce to entropy regularized RL [10], RL [41] and pure exploration [18] respectively. All these examples are important special cases of our proposed robust RL with general utility based on Proposition 2. 2.2 Gradients for Our Robust RL with General Utility Theorem 1. The gradients of the objective function (2) for our proposed robust RL with general utility can be computed as follows. θf(λθ,ξ) = Eπθ,pξ t=0 γt f(λθ,ξ) λθ,ξ(st, at) h=0 θ log πθ(ah|sh) ξf(λθ,ξ) = Eπθ,pξ t=0 γt f(λθ,ξ) λθ,ξ(st, at) h=0 ξ log pξ(sh+1|sh, ah) We make the following standard assumptions which are also used in RL with general utility [55, 6]. Assumption 1. There exist constants lπθ, Lπθ, lpξ, Lpξ > 0 such that for all s, s S, a A, θ, θ Θ and ξ, ξ Ξ, we have θ log πθ(a|s) ℓπθ, θ log πθ (a|s) θ log πθ(a|s) Lπθ θ θ , ξ log pξ(s |s, a) ℓpξ, ξ log pξ (s |s, a) ξ log pξ(s |s, a) Lpξ ξ ξ . Assumption 2. There exist constants lλ, Lλ > 0 such that for all λ, λ S A, λf(λ) lλ and λf(λ ) λf(λ) Lλ λ λ . Proposition 3. Under Assumptions 1 and 2, the gradients (8) and (9) satisfy the following bounds for any θ, θ Θ and ξ, ξ Ξ. θf(λθ,ξ) ℓθ := ℓπθ (1 γ)2 , ξf(λθ,ξ) ℓξ := ℓpξ (1 γ)2 , (10) θf(λθ ,ξ ) θf(λθ,ξ) Lθ,θ θ θ + Lθ,ξ ξ ξ , (11) ξf(λθ ,ξ ) ξf(λθ,ξ) Lξ,θ θ θ + Lξ,ξ ξ ξ , (12) where Lθ,θ := ℓ2 πθ |S||A|) (1 γ)3 + Lπθ ℓλ (1 γ)2 , Lθ,ξ := γℓπθ ℓpξ |S| (1 γ)3 (Lλ + 2ℓλ p |S||A|), Lξ,θ := |S||A|) (1 γ)3 , Lξ,ξ := γℓ2 pξ (1 γ)3 + ℓλ(Lpξ +ℓ2 pξ |S|) In practice, the exact gradients (8) and (9) are unavailable and can only be estimated via stochastic samples. We refer the details to Appendix C as those largely follow [6]. Define the following projected gradients with stepsizes b, a > 0, which have been used to measure convergence of algorithms to stationary points of optimization [30, 5, 47] and RL problems [49, 46, 34]. G(θ) b (θ, ξ):= 1 b θ projΘ θ b θf(λθ,ξ) , G(ξ) a (θ, ξ):= 1 a projΞ ξ+a ξf(λθ,ξ) ξ (13) 3 Gradient Convergence for Convex Utility Assumption 3. The utility function f(λ) is convex. Robust RL with convex utility functions f subsumes many important special cases, including robust constrained RL, robust entropy regularized RL, constrained RL, robust RL, RL, pure exploration, etc., as shown in Examples 2 and 3 in Section 2.1. Partially inspired by the gradient descent ascent (GDA) algorithm [31] for nonconvex-concave minimax optimization, we design the projected stochastic GDA algorithm (Algorithm 1) with two phases to solve robust RL with convex utility. The first phase (called original phase) can be seen as projected stochastic GDA algorithm on the original objective function f. Specifically, in the k-th the outer loop with fixed ξk, the inner loop applies T projected stochastic gradient descent steps (14) to obtain θk which converges to the global solution of Φ(ξk) := minθ Θ f(λθ,ξk) as f is convex. Then, we update ξk using the projected stochastic gradient ascent step (15). However, the output eξ of the first phase only converges to a stationary point of the following the envelope function eΦ 1. eΦ(ξ) := max ξ Ξ Φ(ξ ) Lξ,ξ ξ ξ 2 . (18) To converge to a stationary point of f, we add the second phase (called corrected phase) which applies projected stochastic GDA to the following corrected objective. min θ Θ max ξ Ξ ef(θ, ξ) := f(λθ,ξ) Lξ,ξ ξ eξ 2. (19) 1The convergence rate of eΦ(eξ) is proved in [31] when f(λθ,ξ) is a convex function of θ, and will be proved in Appendix N.2 for our Theorem 2 when f(λ) is convex. Algorithm 1 Projected Stochastic Gradient Descent Ascent Algorithm For Convex Utility 1: Hyperparameters: K, T, K , T , α, β, a, b, Lξ,ξ, {m(k) λ , H(k) λ , m(k) θ , H(k) θ }4 k=1. 2: Initialize: ξ0 Ξ, θ0, θK Θ. # Begin original phase to solve the original optimization problem (2) 3: for Iterations k = 0, 1, . . . , K 1 do 4: Initialize θk,0 θ0. 5: for Inner steps t = 0, 1, . . . , T 1 do 6: Obtain g(θ) k,t θf(λθk,t,ξk) by Algorithm 3 with hyperparameters m(1) λ , H(1) λ , m(1) θ , H(1) θ . 7: Apply the following projected stochastic policy gradient descent step. θk,t+1 = projΘ θk,t αg(θ) k,t . (14) 8: end for 9: Assign θk θk,T . 10: Obtain g(ξ) k ξf(λθk,ξk) by Algorithm 3 with hyperparameters m(2) λ , H(2) λ , m(2) ξ , H(2) ξ . 11: Apply the following projected stochastic gradient descent step. ξk+1 = projΞ ξk + βg(ξ) k . (15) 12: end for 13: Obtain eξ from {ξk}K 1 k=0 uniformly at random. # Begin corrected phase to solve the corrected optimization problem (19) 14: for Iterations k = K, K + 1, . . . , K + K 1 do 15: Initialize ξk,0 eξ. 16: for Inner steps t = 0, 1, . . . , T 1 do 17: Obtain g(ξ) k,t ξf(λθk,ξk,t) by Algorithm 3 with hyperparameters m(3) λ , H(3) λ , m(3) ξ , H(3) ξ . 18: Apply the following projected stochastic gradient ascent step. ξk,t+1 = projΞ ξk,t + a g(ξ) k,t 2Lξ,ξ(ξk,t eξ) . (16) 19: end for 20: Assign ξk ξk,T . 21: Obtain g(θ) k θf(λθk,ξk) by Algorithm 3 with hyperparameters m(4) λ , H(4) λ , m(4) θ , H(4) θ . 22: Apply the following projected stochastic gradient descent step. θk+1 = projΘ θk bg(θ) k . (17) 23: end for 24: Output: (θek, ξek) where ek is obtained from {K, K + 1, . . . , K + K 1} uniformly at random. The convergence analysis of Algorithm 1 is challenging largely because f(λθ,ξ) is only a convex function of λθ,ξ not of θ. To tackle this challenge for non-robust convex RL with fixed ξ, [54] assumed that a global Lipschitz continuous inverse mapping from λθ,ξ to θ exists. [55, 6] relaxed this assumption to the following assumption of local inverse mapping, which covers the popular direct policy parameterization πθ(a|s) = θs,a [6] and softmax policy parameterization πθ(a|s) = exp(θs,a) P a exp(θs,a ) (see Proposition 8 for the proof). Assumption 4 (Local Invertibility of λ ,ξ). There exists constants ℓλ 1 > 0 and δ (0, 1) such that for any fixed θ Θ and ξ Ξ, the occupancy measure (1) satisfies: 1. There exists sets Uθ,ξ Θ and Vλθ,ξ S A that contain θ and λθ,ξ respectively, such that λθ,ξ : Uθ,ξ Vλθ,ξ is a bijection. Its inverse denoted as λ 1 θ,ξ is ℓλ 1-Lipscthiz. 2. There exists at least one optimal policy θ (ξ) arg minθ Θ f(λθ ,ξ) such that for any δ [0, δ], (1 δ)λθ,ξ + δλθ (ξ),ξ Vλθ,ξ. Proposition 4 (Projected Gradient Dominance for Convex Utility). Under Assumptions 1-4, the utility function f satisfies the following gradient dominance property for any θ Θ and ξ Ξ. f(λθ,ξ) min θ Θ f(λθ ,ξ) 2ℓλ 1 βLθ,θ + 1 + βℓθ G(θ) β (θ, ξ) . (20) Remark: Proposition 4 indicates that the function f(λ ,ξ) is projected gradient dominant for convex utility function f, which is important in the convergence analysis of Algorithm 1. Our Proposition 4 is stronger than Lemma F.7 of [6], a similar gradient dominance property for convex RL which requires assumption of positive definite Fisher information matrix, involves bias in the error term and focuses on unconstrained optimization with softmax parameterized policy (a special of our general parameterized policy with constrained variable θ Θ). Technical Novelty. In our proof, to tackle the constraint θ Θ which is more challenging than the unconstrained case Θ = R|S||A| in [55, 6], we apply Assumption 4 to θ := θ βG(θ) β (θ, ξ) not to the obvious choice θ, which yields θδ Θ for any δ [0, δ] such that λθδ,ξ = (1 δ)λθ ,ξ +δλθ (ξ),ξ Vλθ,ξ. Then θf(λθ ,ξ) (θδ θ ) (i) θf(λθ ,ξ) θf(λθ,ξ)+G(θ) β (θ, ξ) (θδ θ ) (ii) O[δ G(θ) β (θ, ξ) ], where (i) uses the projection property (θδ θ ) [G(θ) β (θ, ξ) θf(λθ,ξ)] 0 and (ii) uses θδ θ O(δ). The above bound implies Eq. (20) since f is convex and ℓθ-Lipscthiz. Assumption 5. Ξ is convex and compact with diameter DΞ := maxξ,ξ Ξ ξ ξ > 0. Assumption 5 holds for the commonly used direct kernel parameterization pξ(s |s, a) = ξ(s, a, s ) (for all s, s S and a A)[42, 28, 26, 17] and Ξ defined a compact neighborhood around a nominal transition kernel parameter bξ. We show the gradient convergence result of Algorithm 1 by the following theorem and demonstrate the gradient convergence by the experiments in Appendix A. Theorem 2 (Gradient Convergence for Convex Utility). Suppose Assumptions 1-5 hold. For any precision 0 < ϵ 48Lξ,ξ 2ℓλ 1 Lθ,θ + 4e L + ℓθ , we can always find proper hyperparameter values of Algorithm 1 (see Eqs. (127)-(150) in Appendix N.6 for these hyperparamter values) such that the algorithm output (θek, ξek) is an ϵ-close to a stationary point, that is, E[ G(θ) b (θek, ξek) 2] ϵ2 and E[ G(ξ) a (θek, ξek) 2] ϵ2 with projected gradients G(θ) b and G(ξ) a defined in Eq. (13). The number of required stochastic samples is O log2[(1 γ) 1ϵ 1] (1 γ)25ϵ10 . Proof Sketch of Theorem 2 and Technical Novelty. Inspired by Appendix D of [31], eξ from the first phase satisfies E eΦ(eξ) 2 0 (see Appendix N.2). Then, ξk := ξk,T from the inner update (16) of the second phase converges to the unique maximizer (denoted as ξ k) of the Lξ,ξconcave function ef(θk, ) as T + (see Appendix N.3). This means the update step (17) is approximately projected gradient descent for minθ Θ eΨ(θ), which yields the convergence rate of E G(θ) b (θek, ξek) 2 (see Appendix N.4). However, the biggest challenge is to obtain the convergence rate of E G(ξ) a (θek, ξek) 2 (see Appendix N.5), which corresponds to ξf while the second corrected phase aims at the corrected objective ef. To show that ξ ef(θk, ξk) ξf(θk, ξk), note that ξ ef(θk, ξk) ξf(θk, ξk) = 2Lξ,ξ(ξk eξ) and that eΦ(eξ) = 2Lξ,ξ[ξ (eξ) eξ] 0 (already proved) where ξ (eξ) is the unique maximizer of Φ(ξ ) Lξ,ξ ξ eξ 2, a strongly concave function of ξ in Eq. (18). Hence, it suffices to show ξk ξ (eξ). Note that (θk, ξk) is an approximate Nash equilibrium of ef, i.e., ξk ξ k := arg maxξ Ξ ef(θk, ξ) (proved above) and θk arg minθ Θ ef(θ, ξk) (derived below). E[ ef(θk, ξk) min θ Θ ef(θ , ξk)] = E[f(λθk,ξk) min θ Θ f(λθ ,ξk)] (i) O(E G(θ) β (θ, ξ) ) O(ϵ), where (i) uses Proposition 4. Hence, based on the property of Nash equilibrium, we have ξk arg maxξψ(ξ) = ξ (eξ) where ψ(ξ) := minθ Θ ef(θ, ξ) = Φ(ξ) Lξ,ξ ξ eξ 2. 4 Global Convergence on Polyhedral Ambiguity Set This section aims to obtain a global optimal policy θ that minimizes the robust utility Γ(θ) def = maxξ Ξ f(λθ,ξ). This maximization is challenging for convex utility f. In contrast, global conver- gence results have been obtained without such challenge in some important special cases, including convex RL with fixed ξ [54, 51, 6] and robust RL where linear utility f is amenable to both minθ Θ and maxξ Ξ [36, 45, 42, 23, 26]. Fortunately, we will show that by using the popular s-rectangular polyhedral ambiguity set Ξ, arg maxξ Ξf(λθ,ξ) always exists among the finitely many vertices of Ξ. 4.1 S-rectangular Polyhedral Ambiguity Set In this subsection, we will introduce the popular s-rectangular polyhedral ambiguity set, and derive its important propositions for designing globally converged algorithm. Fhe global convergence is generally NP-hard, even for the important special case called robust RL with linear utility, [45]. A common practice to make the problem tractable is to use direct kernel parameterization pξ(s |s, a) = ξ(s, a, s ) [42, 28, 26, 17] and assume the ambiguity set Ξ to satisfy some certain rectangularity conditions, such as s-rectangularity defined below [45, 42, 23, 26]. Assumption 6. We use direct kernel parameterization and assume that Ξ is s-rectangular, i.e., Ξ = s SΞs := {ξ ( S)S A : ξ(s, , ) Ξs, s S}, a Cartesian product of Ξs ( S)A. Proposition 5 (Local Invertibility of λθ, ). Suppose Assumption 6 holds and Ξ is a convex set. For any θ Θ, ξ0, ξ1 Ξ and δ [0, 1], define the following kernel parameters ξδ ( S)S A. ξδ(s, a, s ) = arbitrary as long as ξδ(s, a, ) S, if λθ,ξ0(s)=λθ,ξ1(s)=0 δλθ,ξ1(s)ξ1(s, a, s )+(1 δ)λθ,ξ0(s)ξ0(s, a, s ) δλθ,ξ1(s)+(1 δ)λθ,ξ0(s) , otherwise , (21) where λθ,ξ(s) := P a A λθ,ξ(s, a) for any s S, θ Θ and ξ Ξ. Then ξδ Ξ and its corresponding occupancy measure is λθ,ξδ = δλθ,ξ1 + (1 δ)λθ,ξ0. Remark: Proposition 5 indicates that the mapping from ξ to λθ,ξ is locally invertible for srectangular set Ξ, which is important to solve the aforementioned challenge that convex utility is not amenable for maxξ Ξ f(λθ,ξ). This role is similar to that played by the local invertibility assumption (Assumption 4) for policy θ. To our knowledge, Proposition 5 has never been obtained in the existing literature. Assumption 7. Under Assumption 6, for every s S, Ξs is a polyhedron spanned by a finite set of vertices V (Ξs):={ξ(s) m }Ms m=1 Ξs, i.e., Ξs = PMs m=1 νmξ(s) m : νm 0, PMs m=1 νm = 1 . Polyhedral ambiguity set defined by Assumption 7 includes the widely used s-rectangular L1 and L ambiguity sets, defined as Ξ = {ξ ( S)S A : ξ(s, :, :) bξ(s, :, :) p αs, s S} for p {1, } respectively [7, 19, 16], where bξ Ξ is the nominal transition kernel usually obtained via empirical estimation. On polyhedral ambiguity set, the optimal kernels arg maxξ Ξf(λθ,ξ) can always be obtained at the vertices of Ξ, as shown below. Proposition 6. Under Assumptions 3, 6 and 7, for any θ Θ, we have maxξ Ξ f(λθ,ξ) = maxξ V (Ξ) f(λθ,ξ), where V (Ξ) = s SV (Ξs) is the vertex set. Technical Novelty. Suppose a non-vertex kernel ξ arg maxξ Ξf(λθ,ξ)/V (Ξ) is optimal. Since f is convex, if λθ,ξ is a convex combination of λθ,ξ 1 and λθ,ξ(ϵ) for some ξ1, ξ0 Ξ (corresponding to ξ 1, ξ(ϵ) respectively in the proof in Appendix I), then ξ1, ξ0 are also optimal. Ideally, if ξ1 V (Ξ) or ξ0 V (Ξ), the proof is done. However, this is not guaranteed since in Proposition 5 and Assumption 6, the convex combination coefficients differ among the states s S. To solve this challenge, it suffices to find such optimal ξ1 that differs from ξ at only one state s such that the non-vertex ξ (s) / V (Ξs) is replaced with vertex ξ1(s) V (Ξs). Then we can conduct such change from non-vertex to vertex for only one state s at a time until the kernel becomes vertex at every state, while keeping the optimality all the way. To find such ξ1(s), note that on polyhedral set Ξs, there always exist ξ1(s) V (Ξs) and ξ0(s) Ξs such that the non-vertex point ξ (s) is a convex combination of ξ1(s) and ξ0(s), while ξ (s ) = ξ1(s ) = ξ0(s ) for any s = s. Hence, there exists δ [0, 1] such that ξ = ξδ defined by Proposition 5, which implies that λθ,ξ is a convex combination of λθ,ξ 1 and λθ,ξ(ϵ). 4.2 Globally Converged Algorithm Algorithm 2 Globally Converged Algorithm for Convex Utility on Polyhedral Ambiguity Set 1: Hyperparameters: K, {σk, ϵk, βk}K 1 k=0 . 2: Initialize: θ0 Θ. 3: for Iterations k = 0, 1, . . . , K 1 do 4: Calculate λθk,ξ, f(λθk,ξ) and θf(λθk,ξ) for all ξ V (Ξ). 5: Select near-optimal vertices Ξk :={ξ V (Ξ) : f(λθk,ξ) maxξ V (Ξ) f(λθk,ξ ) σk}. 6: Find d k B1 := {d RdΘ : d 1} such that Ak(d k) mind B1 Ak(d) + ϵk. (Ak is defined in Eq. (23). One way to solve mind B1 Ak(d) is to apply projected subgradient method (28) and obtain d k arg maxd {dk,t:0 t T }Ak(d).) 7: Let dk := d k/ d k and obtain θk+1 by Eq. (22). 8: end for 9: Output: (θK, ξK) where ξK arg maxξ V (Ξ)f(λθK,ξ). The original objective (2) is equivalent to the minimization problem minθ Θ Γ(θ), where Γ(θ) := maxξ V (Ξ) f(λθ,ξ) with finite vertex set V (Ξ) based on Proposition 6. A natural choice to solve this minimization problem is the following policy update rule (for simplicity we consider the unconstrained policy space Θ = RdΘ as in [55, 6]). θk+1 = θk βkdk, (22) where βk > 0 is the stepsize and dk is a unit descent direction of Γ(θk). Subgradient descent method seems an obvious choice for dk which aligns with the direction of a subgradient θf(λθk,ξk) where ξk arg maxξ V (Ξ)f(λθk,ξ). However, the convergence analysis of subgradient descent method [11] requires the convexity of f(λ ,ξk) which does not hold in our setting, and the function value is not monotonically decreasing. To solve these challenges, we design Algorithm 2 which selects near-optimal vertices Ξk := {ξ V (Ξ) : f(λθk,ξ) maxξ V (Ξ) f(λθk,ξ ) σk} with a certain threshold σk > 0 and obtains dk by solving the convex optimization problem mind B1 Ak(d) up to precision ϵk > 0, where Ak(d) below denotes effective descent of Γ(θk) along the direction d. Ak(d) := max ξ Ξk θf(λθk,ξ) d , d B1 := {d RdΘ : d 1}. (23) Here we only care about the near-optimal vertices in Ξk V (Ξ) because for any worse vertices ξ V (Ξ)/Ξk, f(θk, ξ) < maxξ V (Ξ) f(λθk,ξ ) σk implies f(θk+1, ξ) < maxξ V (Ξ) f(λθk+1,ξ ) for appropriate σk > 0. This means such worse ξ can not affect the optimization progress Γ(θk) Γ(θk+1). Hence, by solving mind B1 Ak(d), we can obtain a direction dk in which all the potentially effective function values {f(λθk,ξ)}ξ Ξk have uniformly large amount of descent f(λθk,ξ) dk. To analyze the global convergence of Algorithm 2, we want to guarantee sufficient descent Γ(θk) Γ(θk+1) whenever θk is not close to optimal. It suffices to slightly alter Assumption 4 as follows. Assumption 8. A variant of Assumption 4 holds which replaces the non-robust optimal policy θ (ξ) with a robust optimal policy θ arg minθ Θ Γ(θ) and shrinks the range from ξ Ξ to ξ V (Ξ). Remark: Assumption 8 is no stronger than Assumption 4 and also covers the popular direct policy parameterization. Also, Assumption 8 guarantees that from any policy θ Θ, there exists a partial curve {θδ : δ [0, δ]} towards a robust optimal policy θ such that λθδ,ξ = (1 δ)λθ,ξ + δλθ ,ξ, so we can utilize convexity of f and obtain the following important sufficient descent property. Proposition 7 (Sufficient Descent on Polyhedral Ambiguity Set). Under Assumptions 1-3 and 8, at any θ Θ := RdΘ, there exists a unit descent direction d ( d = 1) such that f(λθ,ξ) f(λθ ,ξ) 2ℓλ 1 θf(λθ,ξ) d +, ξ Ξ (24) where θ arg minθ Θ Γ(θ) is given by Assumption 8 and x+ := max(x, 0) for any x R. Remark: d in Proposition 7 is a good descent direction since whenever the function value gap f(λθ,ξ) f(λθ ,ξ) > 0, it is dominated by the gradient descent amount θf(λθ,ξ) d > 0. Unlike existing gradient dominance properties for robust RL [42, 26, 17], f(λθ,ξ) f(λθ ,ξ) 0 is possible so we use [ ]+ to cover all cases. This brings challenge and thus novel techniques to obtain the first global convergence result of our robust RL with general convex utility as follows. Theorem 3 (Global Convergence for Convex Utility on Polyhedral Ambiguity Set). Implement Algorithm 2 with βk = 2 2ℓλ k+2 , σk = 4 2ℓθℓλ k+2 and any ϵk > 0. Then under Assumptions 1-3, 6-8, the algorithm output θK has the following global convergence rate. Γ(θK) min θ Θ Γ(θ ) 2ℓλ 1 max 1 k K ϵk + 4ℓλ 1 K + 1(ℓλ 1Lθ,θ + 2 Remark: The convergence rate O(1/K) matches the state-of-the-art of policy gradient type methods for robust RL [26], while the error term ϵk results from solving the convex optimization problem mind B1 Ak(d) in line 6 of Algorithm 2. Technical Novelty. Applying Proposition 7 to Algorithm 2 with σk = 2βkℓθ, we have k := Γ(θk) min θ Θ Γ(θ ) 2ℓλ 1[ϵk Ak(d k)] + + 2βkℓθ. (26) To overcome the main difficulty caused by [ ]+ above, we analyze each k-th iteration in 2 cases Ak(d k) 0 and Ak(d k) < 0. If Ak(d k) 0, then k 2ℓλ 1ϵk + 2βkℓθ and thus k+1 2ℓλ 1ϵk + 3βkℓθ; If Ak(d k) < 0, then in Eq. (26) we replace Ak(d k) with Ak(dk) Ak(d k) < 0 and remove [ ]+. This along with Γ(θk+1) Γ(θk) βk Ak(dk) + Lθ,θ 2 β2 k (by smoothness) implies k+1 k k + 2 k + O h ϵk k + 2 + 1 (k + 2)2 Then we obtain the rate (25) in 3 cases: If Ak(d k) < 0 for all k = 0, 1, . . . , K 1, iterate Eq. (27) from 0; If AK 1(d K 1) 0, K 2ℓλ 1ϵK + σk; If AK 1(d K 1) 0 while Ak(dk) < 0 for all k = K , . . . , K 1, iterate Eq. (27) from K 2ℓλ 1ϵK 1 + 3βK 1ℓθ. Algorithm 2 involves convex optimization problems mind B1 Ak(d), which can be solved via the following projected subgradient method for t = 0, 1, . . . , T 1. dk,t+1 proj B1[dk,t α θf(λθk,ξk,t)], where ξk,t arg maxξ Ξk θf(λθk,ξ) dk,t. (28) The best direction d k arg maxd {dk,t:0 t T }Ak(d) from the above subgradient method achieves ϵk accuracy within T = O(ϵ 2 k ) steps [11], which yields the following complexity result. Corollary 1. Under the conditions of Theorem 3, for any ϵ > 0, implement Algorithm 2 with K = 8ℓλ 1ϵ 1(ℓλ 1Lθ,θ + 2 2ℓθ) iterations and T = 36ℓ2 λ 1ℓ2 θϵ 2 subgradient descent updates (28) with stepsize α = ϵ 3ℓλ 1ℓ2 θ to obtain d k. Then the output θK achieves Γ(θK) minθ Θ Γ(θ ) ϵ. Finally, we can prove that all these Assumptions 1-8 required by our convergence results (Theorems 2 and 3) can be satisfied by the following examples. Proposition 8. Assumptions 1-8 are all satisfied if we use the following choices: Softmax policy parameterization πθ(a|s) = exp(θs,a) P a exp(θs,a ), where θ Θ = [ R, R]|S| |A| for some constant R > 0 to prevent πθ(a|s) from approaching 0. Direct kernel parameterization pξ(s |s, a) = ξs,a,s with s-rectangular L1 or L ambiguity sets defined as Ξ = {ξ ( S)S A : ξ(s, :, :) bξ(s, :, :) p αs, s S} for p {1, } respectively, where the fixed nominal kernel bξ satisfies bξ(s, a, s ) > αs, s, a, s to prevent pξ(s |s, a) from approaching 0. The utility function f(λ) defined in Eq. (7) for robust entropy regularized RL and its special cases, within the range λ Λ = {λθ,ξ : θ Θ, ξ Ξ} for the domains Θ and Ξ selected above. 5 Conclusion In this work, we propose robust RL with general utility, the first learning framework that obtains a robust policy for RL with general utility. We propose a stochastic policy gradient type algorithm for convex utilities and obtains its sample complexity result for gradient convergence. Furthermore, for convex utility on polyhedral ambiguity set, we propose an alternative policy gradient type algorithm and obtain its global convergence rate. Note that this globally converged algorithm requires enumeration among many vertices, and thus it is an important future direction to reduce enumeration by utilizing structural properties. In addition, to extend the results to large or continuous state-action space is also an interesting direction. Acknowledgments This work was partially supported by NSF IIS 2347592, 2347604, 2348159, 2348169, DBI 2405416, CCF 2348306, CNS 2347617. [1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International conference on machine learning, pages 22 31, 2017. [2] Eitan Altman. Constrained Markov decision processes. Routledge, 2021. [3] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469 483, 2009. [4] Kishan Panaganti Badrinath and Dileep Kalathil. Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In International Conference on Machine Learning, pages 511 520, 2021. [5] Maxim Viktorovich Balashov. The gradient projection algorithm for a proximally smooth set and a function with lipschitz continuous gradient. Sbornik: Mathematics, 211(4):481, 2020. [6] Anas Barakat, Ilyas Fatkhullin, and Niao He. Reinforcement learning with general utilities: Simpler variance reduction and large state-action space. In International Conference on Machine Learning, 2023. [7] Bahram Behzadian, Marek Petrik, and Chin Pang Ho. Fast algorithms for l -constrained srectangular robust mdps. In Proceedings of the International Conference on Neural Information Processing Systems (Neurips), 2021. [8] Vivek S Borkar and Sean P Meyn. Risk-sensitive optimal control for markov decision processes with monotone cost. Mathematics of Operations Research, 27(1):192 209, 2002. [9] Winfried Bruns and Joseph Gubeladze. Polytopes, rings, and K-theory. Springer Science & Business Media, 2009. [10] Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563 2578, 2022. [11] Yuxin Chen. Subgradient methods. https://yuxinchen2020.github.io/ele522 _optimization/lectures/subgradient_methods.pdf, 2020. [12] Jerzy A Filar, Lodewijk CM Kallenberg, and Huey-Miin Lee. Variance-penalized markov decision processes. Mathematics of Operations Research, 14(1):147 161, 1989. [13] Matthieu Geist, Julien Pérolat, Mathieu Laurière, Romuald Elie, Sarah Perrin, Oliver Bachem, Rémi Munos, and Olivier Pietquin. Concave utility reinforcement learning: The mean-field game viewpoint. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 489 497, 2022. [14] Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement by minimizing robust baseline regret. In Proceedings of the International Conference on Neural Information Processing Systems (Neurips), volume 29, 2016. [15] Julien Grand-Clément and Christian Kroer. Scalable first-order methods for robust mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12086 12094, 2021. [16] Julien Grand-Clément and Marek Petrik. Reducing blackwell and average optimality to discounted mdps via the blackwell discount factor. In Proceedings of the International Conference on Neural Information Processing Systems (Neurips), 2023. [17] Etash Kumar Guha and Jason D Lee. Solving robust mdps through no-regret dynamics. Ar Xiv:2305.19035, 2023. [18] Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pages 2681 2691, 2019. [19] Chin Pang Ho, Marek Petrik, and Wolfram Wiesemann. Partial policy iteration for l1-robust markov decision processes. Journal of Machine Learning Research, 22(275):1 46, 2021. [20] Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257 280, 2005. [21] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237 285, 1996. [22] Lodewijk CM Kallenberg. Survey of linear programming for standard and nonstandard markovian control problems. part i: Theory. Zeitschrift für Operations Research, 40:1 42, 1994. [23] Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Levy, and Shie Mannor. Policy gradient for rectangular robust markov decision processes. In Proceedings of the International Conference on Neural Information Processing Systems (Neurips), 2023. [24] Navdeep Kumar, Kfir Levy, Kaixin Wang, and Shie Mannor. Efficient policy iteration for robust markov decision processes via regularization. Ar Xiv:2205.14327, 2022. [25] Navdeep Kumar, Kfir Levy, Kaixin Wang, and Shie Mannor. An efficient solution to srectangular robust markov decision processes. Ar Xiv:2301.13642, 2023. [26] Navdeep Kumar, Ilnura Usmanova, Kfir Yehuda Levy, and Shie Mannor. Towards faster global convergence of robust policy gradient methods. In Sixteenth European Workshop on Reinforcement Learning, 2023. [27] Navdeep Kumar, Kaixin Wang, Kfir Levy, and Shie Mannor. Policy gradient for reinforcement learning with general utilities. Ar Xiv:2210.00991, 2023. [28] Mengmeng Li, Tobias Sutter, and Daniel Kuhn. Policy gradient algorithms for robust mdps with non-rectangular uncertainty sets. Ar Xiv:2305.19004, 2023. [29] Yan Li, Guanghui Lan, and Tuo Zhao. First-order policy optimization for robust markov decision process. Ar Xiv:2209.10579, 2023. [30] Zhize Li and Jian Li. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In Proceedings of the International Conference on Neural Information Processing Systems (Neurips), pages 5569 5579, 2018. [31] Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pages 6083 6093, 2020. [32] Tien Mai and Patrick Jaillet. Robust entropy-regularized markov decision processes. Ar Xiv:2112.15364, 2021. [33] Daniel J Mankowitz, Nir Levine, Rae Jeong, Abbas Abdolmaleki, Jost Tobias Springenberg, Yuanyuan Shi, Jackie Kay, Todd Hester, Timothy Mann, and Martin Riedmiller. Robust reinforcement learning for continuous control with model misspecification. In Proceedings of the International Conference on Learning Representations (ICLR), 2020. [34] Weichao Mao, Lin Yang, Kaiqing Zhang, and Tamer Basar. On improving model-free algorithms for decentralized multi-agent reinforcement learning. In International Conference on Machine Learning, pages 15007 15049, 2022. [35] Sobhan Miryoosefi, Kianté Brantley, Hal Daumé, Miroslav Dudík, and Robert E Schapire. Reinforcement learning with convex constraints. In Proceedings of the International Conference on Neural Information Processing Systems (Neurips), pages 14093 14102, 2019. [36] Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005. [37] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pages 3803 3810. IEEE, 2018. [38] Reazul Hasan Russel, Mouhacine Benosman, and Jeroen Van Baar. Robust constrained-mdps: Soft-constrained robust policy optimization under model uncertainty. Ar Xiv:2010.04870, 2020. [39] Stefan Schaal. Learning from demonstration. In Proceedings of the International Conference on Neural Information Processing Systems (Neurips), volume 9, 1996. [40] Zhongchang Sun, Sihong He, Fei Miao, and Shaofeng Zou. Constrained reinforcement learning under model mismatch. Ar Xiv:2405.01327, 2024. [41] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. [42] Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy gradient in robust mdps with global convergence guarantee. In Proceedings of the International Conference on Machine Learning (ICML), volume 202, pages 35763 35797, 23 29 Jul 2023. [43] Yue Wang, Fei Miao, and Shaofeng Zou. Robust constrained reinforcement learning. Ar Xiv:2209.06866, 2022. [44] Yue Wang and Shaofeng Zou. Policy gradient method for robust reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 23484 23526, 2022. [45] Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013. [46] Lin Xiao. On the convergence rates of policy gradient methods. Journal of Machine Learning Research, 23(282):1 36, 2022. [47] Tesi Xiao, Krishna Balasubramanian, and Saeed Ghadimi. A projection-free algorithm for constrained stochastic multi-level composition optimization. In Proceedings of the International Conference on Neural Information Processing Systems (Neurips), 2022. [48] Huan Xu and Shie Mannor. Parametric regret in uncertain markov decision processes. In Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pages 3606 3613. IEEE, 2009. [49] Pan Xu, Felicia Gao, and Quanquan Gu. Sample efficient policy gradient methods with recursive variance reduction. In Proceedings of the International Conference on Learning Representations (ICLR), 2020. [50] Rui Yuan, Robert M Gower, and Alessandro Lazaric. A general sample complexity analysis of vanilla policy gradient. In International Conference on Artificial Intelligence and Statistics, pages 3332 3380, 2022. [51] Tom Zahavy, Brendan O Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is enough for convex mdps. In Proceedings of the International Conference on Neural Information Processing Systems (Neurips), volume 34, pages 25746 25759, 2021. [52] Junyu Zhang, Amrit Singh Bedi, Mengdi Wang, and Alec Koppel. Cautious reinforcement learning via distributional risk in the dual domain. Ar Xiv:2002.12475, 2020. [53] Junyu Zhang, Amrit Singh Bedi, Mengdi Wang, and Alec Koppel. Multi-agent reinforcement learning with general utilities via decentralized shadow reward actor-critic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 9031 9039, 2022. [54] Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvári, and Mengdi Wang. Variational policy gradient method for reinforcement learning with general utilities. In Proceedings of the International Conference on Neural Information Processing Systems (Neurips), pages 4572 4583, 2020. [55] Junyu Zhang, Chengzhuo Ni, Zheng Yu, Csaba Szepesvari, and Mengdi Wang. On the convergence and sample efficiency of variance-reduced policy gradient method. In Proceedings of the International Conference on Neural Information Processing Systems (Neurips), 2021. [56] Ruida Zhou, Tao Liu, Min Cheng, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Natural actor-critic for robust reinforcement learning with function approximation. In Proceedings of the International Conference on Neural Information Processing Systems (Neurips), 2023. Table of Contents A Experiments 15 B Supporting Lemmas 16 C Stochastic Gradients 19 D Proof of Proposition 1 20 E Proof of Proposition 2 21 F Proof of Proposition 3 21 G Proof of Proposition 4 26 H Proof of Proposition 5 26 I Proof of Proposition 6 27 J Proof of Proposition 7 28 K Proof of Proposition 8 29 K.1 Proof of Assumptions 1, 2 and 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 29 K.2 Proof of Assumptions 5, 6 and 7 about ambiguity set Ξ . . . . . . . . . . . . . . 30 K.3 Proof of Assumptions 4 and 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 L Proof of Proposition 9 33 M Proof of Theorem 1 35 N Proof of Theorem 2 36 N.1 Convergence Rate of Inner Update Step (14) of the First Original Phase . . . . . 36 N.2 Convergence Rate of E[ eΦ(eξ) 2] from the First Original Phase . . . . . . . . 37 N.3 Convergence of the Inner Update Step (16) of the Second Corrected Phase . . . . 39 N.4 Convergence Rate of E[ G(θ) b (θek, ξek) 2] . . . . . . . . . . . . . . . . . . . . . 40 N.5 Convergence Rate of E[ G(ξ) a (θek, ξek) 2] . . . . . . . . . . . . . . . . . . . . . 42 N.6 Substituting Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 O Proof of Theorem 3 46 O.1 Analyze the k-th Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 O.2 Obtain the Convergence Rate (25) . . . . . . . . . . . . . . . . . . . . . . . . . 47 P Proof of Corollary 1 48 A Experiments In this section, we present simulation results of Algorithm 1 for convex utility. Simulation Setting. We choose S = {1, 2, , S} with S = 10 states and A = {1, 2, , A} with A = 5 actions. The discount factor is γ = 0.95 and we select uniform distribution as the 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Sample Complexity 1e7 b ( k, k) 2 + G( ) a ( k, k) 2 Figure 1: Numerical Experimental Result (the green vertical line denotes the transition from Phase I to Phase II of Algorithm 1). initial state distribution ρ. To optimize the objective function (2), we apply direct parameterization to policy parameter θs,a = π(a|s) Θ = ( A)S and transition kernel parameter ξs,a,s = p(s |s, a) ( S)S A. In order to preserve ξ(:, :, s ) S, We select nominal kernel ξ( , , s ) as |10+εs | P s |10+εs |, where εs i.i.d N(0, 1) for each s S. Then we select sufficiently small radius r = 0.01 < mins,a,s ξs,a,s and use the L2 ambiguity set Ξ := {ξ : ξ(s, :, :) ξ(s, :, :) r} (for transition kernel) such that all ξ Ξ have all positive entries. As for the general utility function f, we use the following convex entropy function with application to exploration (Example 2.2 of [54]). min θ Θ max ξ Ξ f(λθ,ξ) := X s λθ,ξ(s) log λθ,ξ(s) (29) where λθ,ξ(s) := P a A λθ,ξ(s, a) denotes the state visitation measure for any s S, θ Θ and ξ Ξ. Hyperparameters. For Algorithm 1, we use the following hyperparameters obtained from finetuning but not from Theorem 2: K = 200, T = 25, K = 300, T = 25, α = 0.002, β = 0.001, a = 0.002, b = 0.002, Lξ,ξ = 20, m(1) λ = 15, H(1) λ = 100, m(1) θ = 15, H(1) θ = 100, m(2) λ = 15, H(2) λ = 100, m(2) ξ = 15, H(2) ξ = 100, m(3) λ = 10, H(3) λ = 100, m(3) ξ = 10, H(3) ξ = 100, m(4) λ = 10, H(4) λ = 100, m(4) θ = 10, H(4) θ = 100. Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total. Results. The numerical result of Algorithm 1 is shown in Figure 1. Here the y-axis is the norm of the true projected gradient q G(θ) b (θk, ξk) 2 + G(ξ) a (θk, ξk) 2 at each outer iteration k of both phases of Algorithm 1 (separated by the green vertical dashed line), and the x-axis is the sample complexity (i.e., the total number of generated samples up to iteration k). Figure 1 shows that the projected gradient decays and converges to a small value, which matches Theorem 2. B Supporting Lemmas Lemma 1. Under Assumption 1, for any s, s S, a A, θ, θ Θ, ξ, ξ Ξ, we have |πθ (a|s) πθ(a|s)| ℓπθ θ θ , |pξ (s |s, a) pξ(s |s, a)| ℓpξ ξ ξ , (30) ξpξ (s |s, a) ξpξ(s |s, a) Lpξpξ (s |s, a) + ℓ2 pξ ξ ξ . (31) Proof. Based on Assumption 1, the following inequalities holds for all s, s S, a A, θ Θ, ξ Ξ, which by Lagrange mean value theorem directly proves Eq. (30) θπθ(a|s) θ log πθ(a|s) ℓπθ, ξpξ(s |s, a) ξ log pξ(s |s, a) ℓpξ. Then we prove Eq. (31) as follows. ξpξ (s |s, a) ξpξ(s |s, a) = pξ (s |s, a) ξ log pξ (s |s, a) pξ(s |s, a) ξ log pξ(s |s, a) pξ (s |s, a) ξ log pξ (s |s, a) ξ log pξ(s |s, a) + |pξ (s |s, a) pξ(s |s, a)| ξ log pξ(s |s, a) (i) pξ (s |s, a)Lpξ ξ ξ + ℓpξ ℓpξ ξ ξ Lpξpξ (s |s, a) + ℓ2 pξ ξ ξ , where (i) uses Eq. (30) and Assumption 1. Lemma 2. For any θ Θ and ξ Ξ, the occupancy measure λθ,ξ defined by Eq. (1) is the unique solution to the following Bellman equation of λ R|S| |A|. λ(s , a ) = h (1 γ)ρ(s ) + γ X s,a λ(s, a)pξ(s |s, a) i πθ(a |s ), s S, a A. (32) Therefore, the state occupancy measure λθ,ξ(s) := P a A λθ,ξ(s, a) satisfies λθ,ξ(s, a) = λθ,ξ(s)πθ(a|s). (33) Proof. First, we can prove that λθ,ξ satisfies Eq. (32) as follows. λθ,ξ(s , a ) = (1 γ) t=0 γt Pπθ,pξ(st = s , at = a |s0 ρ) = πθ(a |s )(1 γ) t=0 γt Pπθ,pξ(st = s |s0 ρ) = πθ(a |s )(1 γ) h Pπθ,pξ(s0 = s |s0 ρ) + γ t=0 γt Pπθ,pξ(st+1 = s |s0 ρ) i = πθ(a |s )(1 γ) h ρ(s ) + γ s,a Pπθ,pξ(st = s, at = a|s0 ρ)pξ(s |s, a) i = πθ(a |s ) h (1 γ)ρ(s ) + γ X s,a pξ(s |s, a) t=0 γt Pπθ,pξ(st = s, at = a|s0 ρ) i = h (1 γ)ρ(s ) + γ X s,a λθ,ξ(s, a)pξ(s |s, a) i πθ(a |s ). Next, we prove the uniqueness. Suppose λ1, λ2 R|S| |A| satisfies Eq. (32). Then we have X s ,a |λ2(s , a ) λ1(s , a )| = X s ,a γπθ(a |s ) X s,a [λ2(s, a) λ1(s, a)]pξ(s |s, a) s,a |λ2(s, a) λ1(s, a)|pξ(s |s, a) s,a |λ2(s, a) λ1(s, a)|, which implies that (1 γ) P s,a |λ2(s, a) λ1(s, a)| 0 and thus λ2 = λ1, i.e., the solution to Eq. (32) is unique. Finally, we will prove Eq. (33). Note that λθ,ξ(s) = X a A λθ,ξ(s, a) (i) = (1 γ)ρ(s ) + γ X s,a λ(s, a)pξ(s |s, a), where (i) uses Eq. (32). Then Eq. (33) can be proved by substituting the above equality into Eq. (32). Lemma 3. Under Assumption 1, the occupancy measure (1) satisfies the following Lipschitz properties for any θ, θ Θ and ξ, ξ Ξ. λθ ,ξ λθ,ξ γ pξ pξ + πθ πθ |S| ξ ξ +ℓπθ p |A| θ θ 1 γ . (34) Proof. For any θ, θ Θ and ξ, ξ Ξ, we have s ,a |λθ ,ξ (s , a ) λθ,ξ(s , a )|2 γπθ (a |s ) X λθ ,ξ (s, a)pξ (s |s, a) λθ,ξ(s, a)pξ(s |s, a) + (1 γ)ρ(s ) + γ X s,a λθ,ξ(s, a)pξ(s |s, a) [πθ (a |s ) πθ(a |s )] 2i1/2 γπθ (a |s ) X s,a λθ ,ξ (s, a)[pξ (s |s, a) pξ(s |s, a)] 2 γπθ (a |s ) X s,a pξ(s |s, a)[λθ ,ξ (s, a) λθ,ξ(s, a)] 2 (1 γ)ρ(s ) + γ X s,a λθ,ξ(s, a)pξ(s |s, a) [πθ (a |s ) πθ(a |s )] 2 (iii) γ s X s (ℓpξ ξ ξ )2 s,a pξ(s |s, a)[λθ ,ξ (s, a) λθ,ξ(s, a)] 2 + s X a (ℓπθ θ θ )2 (iv) γℓpξ p |S| ξ ξ + γ s X s,a pξ(s |s, a)|λθ ,ξ (s, a) λθ,ξ(s, a)|2 + ℓπθ p |S| ξ ξ + γ λθ ,ξ λθ,ξ + ℓπθ p where (i) uses Eq. (32), (ii) uses triangular inequality, (iii) uses Lemma 1, P a πθ (a |s )2 1 and P s [(1 γ)ρ(s ) + γ P s,a λθ,ξ(s, a)pξ(s |s, a)] = 1 and (iv) uses Jensen s inequality. Then Eq. (34) can be proved by rearranging the above inequality. Lemma 4. The distance between any pair of probability vectors x, y X on finite space X has the upper bound that x y Proof. Denote d = |X| and xj, yj as the j-th entry of x, y respectively. Then j=1 x2 j + y2 j 2xjyj j=1 xj + yj = 2. Lemma 5. Under Assumptions 1-2, the projected gradients in (13) have the following properties. G(θ) β (θ, ξ) θf(λθ,ξ) ℓθ, G(ξ) β (θ, ξ) ξf(λθ,ξ) ℓξ, (35) G(θ) β (θ , ξ ) G(θ) β (θ, ξ) 1 β + Lθ,θ θ θ + Lθ,ξ ξ ξ , (36) G(ξ) α (θ , ξ ) G(ξ) α (θ, ξ) Lξ,θ θ θ + 1 α + Lξ,ξ ξ ξ . (37) Proof. The proof for G(θ) β (θ, ξ) in Eq. (35) simply follows from the contraction property of projection as follows. G(θ) β (θ, ξ) := 1 θ projΘ θ β θf(λθ,ξ) 1 θ θ β θf(λθ,ξ) = θf(λθ,ξ) . Then, θf(λθ,ξ) ℓθ by Proposition 3. The proof logic for G(ξ) β (θ, ξ) is the same. Next, we prove Eq. (36) as follows and the proof of Eq. (37) follows the same logic. G(θ) β (θ , ξ ) G(θ) β (θ, ξ) = 1 β projΘ[θ β θf(λθ ,ξ )] projΘ[θ β θf(λθ,ξ)] β θ θ + θf(λθ,ξ) θf(λθ ,ξ ) β + Lθ,θ θ θ + Lθ,ξ ξ ξ , (38) where (i) uses Proposition 3. Lemma 6. Suppose X Rd is a closed convex set. For any x Rd and x X, we have [x proj X (x)] [x proj X (x)] 0 (39) Proof. For any δ (0, 1], xδ := δx + (1 δ)proj X (x) belongs to the convex set X since x , proj X (x) X. Then based on the definition of projection we have 0 xδ x 2 proj X (x) x 2 xδ proj X (x) 2 2[xδ proj X (x)] [x proj X (x)] =δ2 x proj X (x) 2 2δ[x proj X (x)] [x proj X (x)]. The above inequality can be rearranged as follows [x proj X (x)] [x proj X (x)] δ 2 x proj X (x) 2, which proves Eq. (39) as δ +0. C Stochastic Gradients To get the stochastic estimation of the gradients (8) and (9), we first estimate the occupancy measure (1) as follows. bλ(τ (λ); s, a) :=1 γ h=0 γh1{s(λ) i,h = s, a(λ) i,h = a}, (40) where 1{ } is an indicator function and τ (λ) := {τ (λ) i }mλ i=1 contains mλ independent trajectories τ (λ) i := {s(λ) i,h , a(λ) i,h }Hλ 1 h=0 (i = 1, . . . , mλ) of length Hλ generated from the policy πθ and transition kernel pξ. Then the estimated cost function is bc := λf[bλ(τ (λ))]. Algorithm 3 Obtain Stochastic Gradients at (θ, ξ) 1: Input: z := (θ, ξ) Z := Θ Z. 2: Hyperparameters: mλ, Hλ, mθ, Hθ, mξ, Hξ. 3: Generate independent trajectories τ (λ) i := {s(λ) i,h , a(λ) i,h }Hλ 1 h=0 (i = 1, . . . , mλ) from πθ, pξ. 4: Obtain bλ(τ (λ); s, a) for every s, a S A by Eq. (40) with τ (λ) := {τ (λ) i }mλ i=1. 5: Obtain bc := λf[bλ(τ (λ))]. 6: Generate independent trajectories τ (θ) i := {s(θ) i,h, a(θ) i,h}Hθ 1 h=0 (i = 1, . . . , mθ) from πθ, pξ. 7: Obtain g(θ)(τ (θ), θ, ξ, bc) by Eq. (41) with τ (θ) := {τ (θ) i }mθ i=1. 8: Generate independent trajectories τ (ξ) i := {s(ξ) i,h, a(ξ) i,h}Hξ 1 h=0 (i = 1, . . . , mξ) from πθ, pξ. 9: Obtain g(ξ)(τ (ξ), θ, ξ, bc) by Eq. (42) with τ (ξ) := {τ (ξ) i }mξ i=1. 10: Output: g(θ)(τ (θ), θ, ξ, bc) θf(λθ,ξ), g(ξ)(τ (ξ), θ, ξ, bc) ξf(λθ,ξ). The stochastic gradients (8) and (9) can be approximated respectively by the following stochastic sample averaged values known as GPOMDP [50]. g(θ)(τ (θ), θ, ξ, bc) = 1 mθ t=0 γtbc(s(θ) i,t , a(θ) i,t ) h=0 θ log πθ(a(θ) i,h | s(θ) i,h) g(ξ)(τ (ξ), θ, ξ, bc) = 1 mλ t=0 γtbc(s(ξ) i,t , a(ξ) i,t ) h=0 ξ log pξ(s(ξ) i,h+1 | s(ξ) i,h, a(ξ) i,h) where τ (θ) := {τ (θ) i }mθ i=1 and τ (ξ) := {τ (ξ) i }mξ i=1 contain mθ independent trajectories τ (θ) i := {s(θ) i,h, a(θ) i,h}Hθ 1 h=0 (i = 1, . . . , mθ) and mξ independent trajectories τ (ξ) i := {s(ξ) i,h, a(ξ) i,h}Hξ 1 h=0 {s(ξ) i,Hξ} (i = 1, . . . , mξ) respectively, both generated from the policy πθ and transition kernel pξ. We summarize the procedure of obtaining the stochastic gradients (8) and (9) in Algorithm 3. These stochastic gradients approximate the true gradients with the following error bounds. Proposition 9. Under Assumptions 1 and 2, the stochastic gradients (41) and (42) have the following error bounds. Eπθ,pξ g(θ)(τ (θ), θ, ξ, bc) θf(λθ,ξ) 2 3ℓ2 πθ (1 γ)4 h L2 λ|S||A| 1 mλ + γ2Hλ + ℓ2 λ mθ + ℓ2 λ[1 + Hθ(1 γ)]2γ2Hθ i , (43) Eπθ,pξ g(ξ)(τ (ξ), θ, ξ, bc) ξf(λθ,ξ) 2 3ℓ2 pξ (1 γ)4 h L2 λ|S||A| 1 mλ + γ2Hλ + ℓ2 λ mξ + ℓ2 λ[1 + Hξ(1 γ)]2γ2Hξ i . (44) D Proof of Proposition 1 As follows, we slightly rewrite the utility function f defined in Eq. (5), by replacing λ with λθ,ξ. f(λθ,ξ) = c(0), λθ,ξ , if c(k), λθ,ξ τk for all k = 1, . . . , K + , otherwise . Therefore, for any θ Θ, we have max ξ Ξ f(λθ,ξ) = ( max ξ Ξ c(0), λθ,ξ , if c(k), λθ,ξ τk for all ξ Ξ and k = 1, . . . , K + , otherwise . Recalling the definition of the robust value function (3), i.e., V (k) θ def = maxξ Ξ c(k), λθ,ξ , the equation above can be rewritten as follows. max ξ Ξ f(λθ,ξ) = ( V (0) θ , if V (k) θ τk for all k = 1, . . . , K + , otherwise . Therefore, θ Θ minimizes maxξ Ξ f(λθ,ξ) if and only if θ solves the constrained robust RL problem (4), as repeated below. min θ Θ V (0) θ , s.t. V (k) θ τk for all k = 1, . . . , K. Finally, we will prove that f(λ) defined in Eq. (5) is a convex function. Note that Ak = {λ S A : c(k), λ τk} is a convex set, so A = K k=1Ak is also a convex set. Then for any λ1, λ0 S A and α [0, 1], we aim to prove that f[αλ1 + (1 α)λ0] αf(λ1) + (1 α)f(λ0). (45) If either λ1 / A or λ0 / A, then Eq. (45) obviously holds as the right side equals + . Otherwise, if λ1, λ0 A, then δλ1 + (1 δ)λ0 A as A is a convex set, and thus Eq. (45) holds with equality as proved below. f[δλ1 + (1 δ)λ0] = c(0), δλ1 + (1 δ)λ0 =δ c(0), λ1 + (1 δ) c(0), λ0 =δf(λ1) + (1 δ)f(λ0). E Proof of Proposition 2 The utility function f in Eq. (7) satisfies f(λθ,ξ) = X s,a λθ,ξ(s, a) h c(s, a) + µ log λθ,ξ(s, a) P a λθ,ξ(s, a ) λθ,ξ(s, a)c(s, a) + µ X s,a λθ,ξ(s)πθ(a|s) log πθ(a|s) λθ,ξ(s, a)c(s, a) µ X λθ,ξ(s)H[πθ( |s)] . where (i) uses λθ,ξ(s) = P a λθ,ξ(s, a) and Eq. (33) that λθ,ξ(s, a) = λθ,ξ(s)πθ(a|s), and (ii) denotes the entropy function that H[πθ( |s)] = P a πθ(a|s) log πθ(a|s). The above function is exactly the minimax objective function (6) of the robust entropy regularized RL. Finally, we will prove that f(λ) defined in Eq. (7) is a convex function. For any λ0, λ1 S A and δ [0, 1], denote λδ = δλ1 + (1 δ)λ0, λδ(s) = P a λδ(s, a) and policy πδ(a|s) = λδ(s,a) λδ(s) . Then, the convexity of f can be proved as follows. δf(λ1) + (1 δ)f(λ0) f(λδ) h δλ1(s, a) log λ1(s, a) λ1(s) + (1 δ)λ0(s, a) log λ0(s, a) λ0(s) λδ(s, a) log λδ(s, a) h δλ1(s, a) log π1(a|s) + (1 δ)λ0(s, a) log π0(a|s) [δλ1(s, a) + (1 δ)λ0(s, a)] log πδ(a|s) i h δλ1(s)π1(a|s) log π1(a|s) πδ(a|s) + (1 δ)λ0(s)π0(a|s) log π0(a|s) h δλ1(s)KL[π1( |s) πδ( |s)] + (1 δ)λ0(s)KL[π0( |s) πδ( |s)] i 0. F Proof of Proposition 3 The first formula of Eq. (10) can be proved as follows and the second formula can be proved in the same way. θf(λθ,ξ) (i) Eπθ,pξ t=0 γt|c(st, at)| h=0 θ log πθ(ah|sh) (ii) Eπθ,pξ t=0 γt(t + 1) = ℓπθ (1 γ)2 , where (i) uses Eq. (41) and (ii) uses c(st, at) [0, 1] and Assumption 1. Define the following V function. Vθ,ξ(c) := Eπθ,pξ h X t=0 γtc(st, at) s0 ρ i . (46) For any fixed cost function c : S A R, the gradient θVθ,ξ(c) can be rewritten as follows. θVθ,ξ(c) (i) =Eπθ,pξ t=0 γtc(st, at) h=0 θ log πθ(ah|sh) h=0 γh θ log πθ(ah|sh) t=h γt hc(st, at) s,a Pπθ,pξ(sh = s, ah = a|s0 ρ) θ log πθ(a|s) t=h γt hc(st, at) sh = s, ah = a h=0 γh Pπθ,pξ(sh = s, ah = a|s0 ρ) θ log πθ(a|s) t=0 γtc(st, at) s0 = s, a0 = a (ii) = 1 1 γ s,a λθ,ξ(s, a) θ log πθ(a|s)Qθ,ξ(s, a; c), (47) where (i) uses Eq. (5) of [6] and (ii) uses the occupancy measure (1) and defines the following Q function. Qθ,ξ(s, a; c) := Eπθ,pξ t=0 γtc(st, at) s0 = s, a0 = a The above Q function has the following upper bound |Qθ,ξ(s, a; c)| cmax where cmax := maxs,a |c(s, a)| and also satisfies the following Bellman equation. Qθ,ξ(s, a; c) = c(s, a) + γ X s ,a pξ(s |s, a)πθ(a |s )Qθ,ξ(s , a ; c). (50) Therefore, for any θ, θ Θ, ξ, ξ Ξ and fixed cost function c, we have max s,a |Qθ ,ξ (s, a; c) Qθ,ξ(s, a; c)| s ,a |pξ (s |s, a) pξ(s |s, a)|πθ (a |s )|Qθ ,ξ (s , a ; c)| + γ max s,a s ,a pξ(s |s, a)|πθ (a |s ) πθ(a |s )||Qθ ,ξ (s , a ; c)| + γ max s,a s ,a pξ(s |s, a)πθ(a |s )|Qθ ,ξ (s , a ; c) Qθ,ξ(s , a ; c)| 1 γ max s,a pξ ( |s, a) pξ( |s, a) 1 + max s πθ ( |s) πθ( |s) 1 + γ max s,a |Qθ ,ξ (s, a; c) Qθ,ξ(s, a; c)|, where (i) uses Eq. (49). Rearranging the above inequality yields that max s,a |Qθ ,ξ (s, a; c) Qθ,ξ(s, a; c)| (1 γ)2 max s,a pξ ( |s, a) pξ( |s, a) 1 + max s πθ ( |s) πθ( |s) 1 (1 γ)2 ℓpξ|S| ξ ξ + ℓπθ|A| θ θ . (51) where (i) uses Lemma 1. For any θ Θ, ξ Ξ and fixed cost functions c, c , we have max s,a |Qθ,ξ(s, a; c) Qθ,ξ(s, a; c )| (i) Eπθ,pξ t=0 γt|c (st, at) c(st, at)| s0 = s, a0 = a t=0 γt c c = c c where (i) uses Eq. (48). Therefore, Eq. (11) can be proved as follows. θf(λθ ,ξ ) θf(λθ,ξ) = θVθ ,ξ [ λf(λθ ,ξ )] θVθ,ξ[ λf(λθ,ξ)] λθ ,ξ (s, a) λθ,ξ(s, a) θ log πθ (a|s) Qθ ,ξ [s, a; λf(λθ ,ξ )] s,a λθ,ξ(s, a) θ log πθ (a|s) θ log πθ(a|s) Qθ ,ξ [s, a; λf(λθ ,ξ )] s,a λθ,ξ(s, a) θ log πθ(a|s) Qθ ,ξ [s, a; λf(λθ ,ξ )] Qθ,ξ[s, a; λf(λθ ,ξ )] s,a λθ,ξ(s, a) θ log πθ(a|s) Qθ,ξ[s, a; λf(λθ ,ξ )] Qθ,ξ[s, a; λf(λθ,ξ)] (ii) ℓπθℓλ (1 γ)2 X λθ ,ξ (s, a) λθ,ξ(s, a) + Lπθℓλ (1 γ)2 X s,a λθ,ξ(s, a) θ θ s,a λθ,ξ(s, a) γℓλ (1 γ)2 ℓpξ|S| ξ ξ + ℓπθ|A| θ θ + ℓπθ (1 γ)2 X s,a λθ,ξ(s, a) λf(λθ ,ξ ) λf(λθ,ξ) (iii) ℓπθℓλ p |S||A| (1 γ)2 λθ ,ξ λθ,ξ + Lπθℓλ (1 γ)2 θ θ (1 γ)3 ℓpξ|S| ξ ξ + ℓπθ|A| θ θ + ℓπθLλ (1 γ)2 λθ ,ξ λθ,ξ (iv) ℓπθ(Lλ + ℓλ p |S||A|) (1 γ)2 γℓpξ p |S| ξ ξ +ℓπθ p |A| θ θ 1 γ + Lπθℓλ (1 γ)2 θ θ (1 γ)3 ℓpξ|S| ξ ξ + ℓπθ|A| θ θ (v) γℓπθℓpξ p |S| (1 γ)3 (Lλ+2ℓλ p |S||A|) ξ ξ + ℓ2 πθ p |A|(Lλ+ℓλ p |S||A|) (1 γ)3 + Lπθℓλ where (i) uses the gradient (47), (ii) uses Assumptions 1-2 and Eqs. (49), (51) and (52) with cmax = λf(λθ ,ξ ) replaced by its upper bound ℓλ, (iii) uses Assumption 2, (iv) uses Lemma 3 and (v) uses Assumption 1. The proof of Eq. (12) follows the same logic. To elaborate, ξVθ,ξ(c) can be derived as follows in a similar way to the derivation of Eq. (47) ξVθ,ξ(c) :=Eπθ,pξ t=0 γtc(st, at) h=0 ξ log pξ(sh+1|sh, ah) s,a,s Pπθ,pξ(sh = s, ah = a, sh+1 = s |s0 ρ) ξ log pξ(s |s, a) t=h γt hc(st, at) sh = s, ah = a, sh+1 = s ! s,a,s Pπθ,pξ(sh = s, ah = a|s0 ρ)pξ(s |s, a) ξ log pξ(s |s, a) t=1 γtc(st, at) s0 = s, a0 = a, s1 = s ! (i) = 1 1 γ s,a,s λθ,ξ(s, a) ξpξ(s |s, a)[c(s, a) + γVθ,ξ(s ; c)], (53) where (i) uses the occupancy measure (1) and defines the following V function. Vθ,ξ(s ; c) := Eπθ,pξ h X t=0 γtc(st, at) s0 = s i . (54) The above V function has the following upper bound |Vθ,ξ(s, a; c)| cmax where cmax := maxs,a |c(s, a)| and also satisfies the following Bellman equation. Vθ,ξ(s; c) = X a πθ(a|s) h c(s, a) + γ X s pξ(s |s, a)Vθ,ξ(s ; c) i . (56) As a result, max s |Vθ ,ξ (s; c) Vθ,ξ(s; c)| πθ (a|s) πθ(a|s) h c(s, a) + γ X s pξ (s |s, a) Vθ ,ξ (s ; c) i a πθ(a|s) X pξ (s |s, a) pξ(s |s, a) Vθ ,ξ (s ; c) a πθ(a|s) X s pξ(s |s, a) Vθ ,ξ (s ; c) Vθ,ξ(s ; c) i 1 γ ℓπθ|A| θ θ + γℓpξ|S| ξ ξ + γ max s |Vθ ,ξ (s; c) Vθ,ξ(s; c)|, where (i) uses Eq. (55) and Lemma 1. Rearranging the above inequality yields that max s |Vθ ,ξ (s; c) Vθ,ξ(s; c)| cmax (1 γ)2 ℓπθ|A| θ θ + γℓpξ|S| ξ ξ . (57) Similar to Eq. (52), we have max s |Vθ,ξ(s; c ) Vθ,ξ(s; c)| c c Therefore, we can prove Eq. (12) as follows. ξf(λθ ,ξ ) ξf(λθ,ξ) = ξVθ ,ξ [ λf(λθ ,ξ )] ξVθ,ξ[ λf(λθ,ξ)] λθ ,ξ (s, a) λθ,ξ(s, a) ξpξ (s |s, a) λf(λθ ,ξ )(s, a) + γVθ ,ξ [s ; λf(λθ ,ξ )] s,a,s λθ,ξ(s, a) ξpξ (s |s, a) ξpξ(s |s, a) λf(λθ ,ξ )(s, a) + γVθ ,ξ [s ; λf(λθ ,ξ )] s,a,s λθ,ξ(s, a) ξpξ(s |s, a) Vθ ,ξ [s; λf(λθ ,ξ )] Vθ,ξ[s; λf(λθ ,ξ )] s,a,s λθ,ξ(s, a) ξpξ(s |s, a) γ Vθ,ξ[s; λf(λθ ,ξ )] Vθ,ξ[s; λf(λθ,ξ)] + λf(λθ ,ξ )(s, a) λf(λθ,ξ)(s, a) (ii) ℓpξ 1 γ s,a,s pξ (s |s, a) λθ ,ξ (s, a) λθ,ξ(s, a) s,a,s λθ,ξ(s, a) Lpξpξ (s |s, a) + ℓ2 pξ ξ ξ s,a,s λθ,ξ(s, a)pξ(s |s, a) ℓλ (1 γ)2 ℓπθ|A| θ θ + γℓpξ|S| ξ ξ s,a,s λθ,ξ(s, a)pξ (s |s, a) " γ λf(λθ ,ξ ) λf(λθ,ξ) 1 γ + λf(λθ ,ξ ) λf(λθ,ξ) (iii) ℓpξℓλ p |S||A| (1 γ)2 λθ ,ξ λθ,ξ + ℓλ Lpξ + ℓ2 pξ|S| (1 γ)3 ℓπθ|A| θ θ + γℓpξ|S| ξ ξ + ℓpξLλ (1 γ)2 λθ ,ξ λθ,ξ (iv) ℓpξ Lλ + ℓλ p (1 γ)2 γℓpξ p |S| ξ ξ +ℓπθ p |A| θ θ 1 γ + ℓλ Lpξ + ℓ2 pξ|S| (1 γ)3 ℓπθ|A| θ θ + γℓpξ|S| ξ ξ |A| Lλ + ℓλ p |S| Lλ + 2ℓλ p (1 γ)3 + ℓλ Lpξ + ℓ2 pξ|S| where (i) uses the gradient (53) and denotes λf(λθ ,ξ )(s, a) = f(λθ ,ξ ) λ(s,a) as the (s, a)-th element of λf(λθ ,ξ ), (ii) uses ξpξ (s |s, a) = pξ (s |s, a) ξ log pξ (s |s, a), Assumptions 1-2 and Eqs. (31), (55), (57) and (58) with cmax = λf(λθ ,ξ ) replaced by its upper bound ℓλ, (iii) uses Assumptions 2 and (iv) uses Eq. (34). G Proof of Proposition 4 Implement one projected gradient step from θ and obtain θ = projΘ θ β θf(λθ,ξ) = θ βG(θ) β (θ, ξ) where the projected gradient G(θ) β (θ, ξ) is defined by Eq. (13). Based on Assumption 4, for any δ [0, δ], there exists θδ Θ such that λθδ,ξ = (1 δ)λθ ,ξ + δλθ (ξ),ξ Vλθ,ξ. Based on Assumption 4, for any δ [0, δ], there exists θδ Θ such that λθδ,ξ = (1 δ)λθ ,ξ + δλθ (ξ),ξ Vλθ ,ξ. Then we have, θδ θ (i) ℓλ 1 λθδ,ξ λθ,ξ = δℓλ 1 λθ (ξ),ξ λθ,ξ (ii) 2δℓλ 1, (59) where (i) uses the Lθ,θ-smoothness of f(λ ,ξ) based on Proposition 3, (ii) uses Lemma 4. By applying Lemma 6 to X = Θ, x = θ β θf(λθ,ξ), x = θδ Θ (so proj X (x) = θ = θ βG(θ) β (θ, ξ)), we obtain that (θδ θ ) [G(θ) β (θ, ξ) θf(λθ,ξ)] 0. (60) Then on one hand, f(λθδ,ξ) has the following lower bound. f(λθδ,ξ) (i) f(λθ ,ξ) + θf(λθ ,ξ) (θδ θ ) Lθ,θ (ii) f(λθ ,ξ) + θf(λθ ,ξ) θf(λθ,ξ) + G(θ) β (θ, ξ) (θδ θ ) Lθ,θ (iii) f(λθ ,ξ) Lθ,θ θ θ + G(θ) β (θ, ξ) θδ θ Lθ,θ (iv) f(λθ ,ξ) 2δℓλ 1 βLθ,θ + 1 G(θ) β (θ, ξ) Lθ,θδ2ℓ2 λ 1, (61) where (i) and (iii) use Lθ,θ-smoothness of f(λ ,ξ) based on Proposition 3, (ii) uses Eq. (60), and (iv) uses Lemma 5, θδ θ 2δℓλ 1 (obtained in the same way as Eq. (59)) and θ θ = βG(θ) β (θ, ξ). On the other hand, f(λθδ,ξ) has the following upper bound since f is convex. f(λθδ,ξ) (1 δ)f(λθ ,ξ) + δf(λθ (ξ),ξ) = (1 δ)f(λθ ,ξ) + δ min θ Θ f(λθ ,ξ). (62) The above two inequalities (61) and (62) imply that f(λθ ,ξ) min θ Θ f(λθ ,ξ) lim sup δ +0 1 δ f(λθ ,ξ) f(λθδ,ξ) 2ℓλ 1 βLθ,θ + 1 G(θ) β (θ, ξ) . (63) Finally, we prove Eq. (20) as follows. f(λθ,ξ) (i) f(λθ ,ξ) + ℓθ θ θ (ii) min θ Θ f(λθ ,ξ) + 2ℓλ 1 βLθ,θ + 1 + βℓθ G(θ) β (θ, ξ) . where (i) uses Eq. (10) which implies that fλ ,ξ is ℓθ-Lipschitz, (ii) uses θ θ = βG(θ) β (θ, ξ) and Eq. (63). H Proof of Proposition 5 We will first prove that ξδ Ξ. Note that the s-rectangular ambiguity set Ξ can be expressed as a Cartesian product of Ξs for all s S. Hence, as Ξ is convex, Ξs is convex for all s S. Therefore, for any s S, ξδ(s, , ) Ξs since it is a convex combination of ξ0(s, , ) Ξs and ξ1(s, , ) Ξs defined by Eq. (21), so ξδ Ξ. Next, we will prove λθ,ξδ = δλθ,ξ1 + (1 δ)λθ,ξ0. Denote λδ := δλθ,ξ1 + (1 δ)λθ,ξ0, so the aim becomes to prove λθ,ξδ = λδ. Based on Lemma 2, it suffices to prove the following equation. λδ(s , a ) = h (1 γ)ρ(s ) + γ X s,a λδ(s, a)ξδ(s |s, a) i πθ(a |s ), s S, a A. (64) For each s S, consider the following two cases. (Case 1): λθ,ξ1(s) > 0 or λθ,ξ0(s) > 0. Note that λδ(s, a)ξδ(s, a, s ) (i) = δλθ,ξ1(s, a) + (1 δ)λθ,ξ0(s, a) δλθ,ξ1(s)ξ1(s, a, s ) + (1 δ)λθ,ξ0(s)ξ0(s, a, s ) δλθ,ξ1(s) + (1 δ)λθ,ξ0(s) (ii) = πθ(a|s) δλθ,ξ1(s) + (1 δ)λθ,ξ0(s) δλθ,ξ1(s)ξ1(s, a, s ) + (1 δ)λθ,ξ0(s)ξ0(s, a, s ) δλθ,ξ1(s) + (1 δ)λθ,ξ0(s) (iii) = δλθ,ξ1(s, a)ξ1(s, a, s ) + (1 δ)λθ,ξ0(s, a)ξ0(s, a, s ), (65) where (i) uses Eq. (21) and λδ := δλθ,ξ1 + (1 δ)λθ,ξ0, (ii) and (iii) use Eq. (33). (Case 2): λθ,ξ0(s) = λθ,ξ1(s) = 0. In this case, λδ(s) = δλθ,ξ1(s)+(1 δ)λθ,ξ0(s) = 0 and thus λδ(s, a) = λθ,ξ1(s, a) = λθ,ξ0(s, a) = 0 for any a A, so Eq. (65) also holds for any choice of ξδ(s, , ). Therefore, we can prove Eq. (64) as follows. h (1 γ)ρ(s ) + γ X s,a λδ(s, a)ξδ(s |s, a) i πθ(a |s ) (i) = h (1 γ)ρ(s ) + γ X s,a [δλθ,ξ1(s, a)ξ1(s, a, s ) + (1 δ)λθ,ξ0(s, a)ξ0(s, a, s )] i πθ(a |s ) =δ h (1 γ)ρ(s ) + γ X s,a λθ,ξ1(s, a)ξ1(s, a, s ) i πθ(a |s ) + (1 δ) h (1 γ)ρ(s ) + γ X s,a λθ,ξ0(s, a)ξ0(s, a, s ) i πθ(a |s ) (ii) = δλθ,ξ1(s, a) + (1 δ)λθ,ξ0(s, a) = λδ(s, a), where (i) uses Eq. (65) and λδ := δλθ,ξ1 + (1 δ)λθ,ξ0 and (ii) applies Lemma 2 to λθ,ξ1 and λθ,ξ0. I Proof of Proposition 6 Fix any θ Θ, and there exists at least one ξ arg maxξ Ξf(λθ,ξ ). If ξ V (Ξ) := s SV (Ξs), then this proposition directly holds. Hence, we focus on the case where ξ / V (Ξ), which means ξ (s0) / V (Ξs0) for at least one s0 S. Based on Assumption 6-7, there exists a probability vector ν := [ν1, . . . , νMs0] such that ξ (s0) = PMs0 m=1 νmξ(s0) m , where we denote ξ (s) := ξ (s, , ) Ξs for all ξ Ξ. Without loss of generality, we assume ν1 = max1 m Ms νm (Otherwise we can make this assumption hold by permutating the elements in each Ξs.). For any ϵ > 0, define ξ 1, ξ(ϵ) ( S)S A such that ξ 1(s) = ξ(ϵ)(s) = ξ (s) for any s = s0, while at s = s0 we define ξ 1(s0) = ξ(s0) 1 V (Ξs0) and ξ(ϵ)(s0) =ξ (s0) + ϵ[ξ (s0) ξ 1(s0)] = [(1 + ϵ)ν1 ϵ]ξ(s0) 1 + m=2 (1 + ϵ)νmξ(s0) m , which implies that ξ = ϵξ(ϵ) + ξ 1 1 + ϵ . (66) It is easily seen that ξ 1 Ξ by its definition. Since limϵ +0[(1 + ϵ)ν1(s) ϵ] = ν1(s) = max1 m Ms νm(s) > 0, there exists a sufficiently small ϵ > 0 such that [(1 + ϵ)ν1(s) ϵ, (1 + ϵ)ν2(s), . . . , (1+ϵ)νMs(s)] [0, 1]|Ξs| is a probability vector and thus ξ(ϵ) Ξ. Furthermore, select arbitrary δ [0, 1] if λθ,ξ(ϵ)(s0) = λθ,ξ 1 (s0) = 0 and the following δ otherwise. δ = λθ,ξ(ϵ)(s0) ϵλθ,ξ 1 (s0) + λθ,ξ(ϵ)(s0), (67) where λθ,ξ(s) := P a A λθ,ξ(s, a) is defined as the state occupancy measure for any s S, θ Θ and ξ Ξ. Then it can be directly verified that ξ satisfies the following equality. ξ (s, a, s ) = arbitrary as long as ξ (s, a, ) S, if λθ,ξ(ϵ)(s)=λθ,ξ 1 (s)=0 δλθ,ξ 1 (s)ξ 1(s, a, s )+(1 δ)λθ,ξ(ϵ)(s)ξ(ϵ)(s, a, s ) δλθ,ξ 1 (s)+(1 δ)λθ,ξ(ϵ)(s) , otherwise . Hence, based on Proposition 5, ξ satisfies λθ,ξ = δλθ,ξ 1 + (1 δ)λθ,ξ(ϵ). (68) On one hand, f(λθ,ξ 1 ) f(λθ,ξ ) and f(λθ,ξ(ϵ)) f(λθ,ξ ) since ξ ξ arg maxξ Ξf(λθ,ξ ). On the other hand, the above Eq. (68) along with convexity of f implies that f(λθ,ξ ) δf(λθ,ξ 1 ) + (1 δ)f(λθ,ξ(ϵ)). Therefore, f(λθ,ξ 1 ) = f(λθ,ξ(ϵ)) = f(λθ,ξ ) = maxξ Ξ f(λθ,ξ ). If ξ 1 V (Ξ), then the proof is done. Otherwise, note that ξ 0(s0) / V (Ξs0) while ξ 1(s0) V (Ξs0), and ξ 1(s) = ξ 0(s) for any s = s0. Hence, in the same way, we can obtain the sequence ξ 2, ξ 3, . . . , ξ N that satisfies the following conditions by changing non-vertex into vertex at one state each time until no non-vertex remains (i.e., until the condition 2 below holds): 1. For 1 k N 1, ξ k(sk) / V (Ξsk) while ξ k+1(sk) V (Ξsk), and ξ k(s) = ξ k+1(s) for any s = sk. 2. ξ N V (Ξ). 3. For 1 k N, f(λθ,ξ k) = maxξ Ξ f(λθ,ξ ). As a result, we find the optimal vertex ξ N V (Ξ) arg maxξ Ξf(λθ,ξ ), which concludes the proof. J Proof of Proposition 7 Based on Assumption 8, for any θ Θ and δ [0, δ], there exists θδ Θ such that λθδ,ξ = (1 δ)λθ,ξ + δλθ ,ξ. θδ θ (i) ℓλ 1 λθδ,ξ λθ,ξ = δℓλ 1 λθ ,ξ λθ,ξ (ii) 2δℓλ 1, (69) where (i) uses the Lθ,θ-smoothness of f(λ ,ξ) based on Proposition 3, (ii) uses Lemma 4. Hence, on one hand, using Lθ,θ-smoothness of f(λ ,ξ) based on Proposition 3, f(λθδ,ξ) has the following lower bound. f(λθδ,ξ) f(λθ,ξ) + θf(λθ,ξ) (θδ θ) Lθ,θ 2 θδ θ 2, (70) On the other hand, f(λθδ,ξ) has the following upper bound since f is convex. f(λθδ,ξ) (1 δ)f(λθ,ξ) + δf(λθ ,ξ) = (1 δ)f(λθ,ξ) + δ min θ Θ max ξ Ξ f(λθ ,ξ ). (71) Combining the above two inequalities, we obtain that θf(λθ,ξ) θδ θ θδ θ f(λθ,ξ) f(λθδ,ξ) (i) δ[f(λθ,ξ) minθ Θ maxξ Ξ f(λθ ,ξ )] (ii) f(λθ,ξ) minθ Θ maxξ Ξ f(λθ ,ξ ) where (i) uses Eq. (71), and (ii) uses Eq. (69) and assumes f(λθ,ξ) minθ Θ maxξ Ξ f(λθ ,ξ ) without loss of generality (otherwise, Eq. (24) trivially holds). Based on the Bolzano Weierstrass theorem, there exists a sequence δn +0 such that θδn θ θδn θ d RdΘ as n + . Hence, d = 1 and we can conclude the proof by letting δ = δn and n + in the above inequality. K Proof of Proposition 8 K.1 Proof of Assumptions 1, 2 and 3 Proposition 2 indicates that the utility function f defined by Eq. (7) is convex, which proves Assumption 3. To prove Assumptions 1 and 2, it suffices to prove that the following functions have bounded first-order and second-order derivatives for any (s, a, s ) S A S. 1. log πθ(a|s) = θs,a log P a exp(θs,a ) as a function of θ Θ = [ R, R]|S| |A|. 2. log pξ(s |s, a) = log ξ(s, a, s ) as a function of ξ Ξ = {ξ ( S)S A : ξ (s, :, : ) bξ(s, :, :) p αs, s S} where p {1, } and bξ(s, a, s ) > αs, s, a, s . 3. f(λ) = P s,a λ(s, a) c(s, a) + µ log λ(s,a) P a λ(s,a ) (repeat Eq. (7)) as a function of λ Λ = {λθ,ξ : θ Θ, ξ Ξ}. For any θ Θ = [ R, R]|S| |A|, we have πθ(a|s) = exp(θs,a) P a exp(θs,a ) πmin def = exp( R) exp( R) + (|A| 1) exp(R) > 0. (73) When p = 1 or p = , any ξ Ξ satisfies pξ(s |s, a) = ξ(s, a, s ) bξ(s, a, s ) ξ(s, :, :) bξ(s, :, :) p bξ(s, a, s ) αs ξmin def = min s,a,s [bξ(s, a, s ) αs] > 0, (74) where ξmin > 0 since it is minimum over finitely many positive numbers bξ(s, a, s ) αs. Then for any θ Θ and ξ Ξ, we have λθ,ξ(s, a) def = (1 γ) t=0 γt Pπθ,pξ(st = s, at = a|s0 ρ) γ(1 γ)Pπθ,pξ(s1 = s, a1 = a|s0 ρ) s ,a ρ(s )πθ(a |s )pξ(s|s , a )πθ(a|s) s ,a ρ(s )πθ(a |s )ξminπmin = λmin def = ξminπminγ(1 γ) > 0. (75) Finally, for any (s, a, s ), (s1, a1, s 1), (s2, a2, s 2) S A S, we obtain all the derivative bounds as follows, where 1{ } is an indicator function. log πθ(a|s) θ(s1, a1) = 1{s1 = s} 1{a1 = a} πθ(a1|s) [ 1, 1]. 2 log πθ(a|s) θ(s1, a1) θ(s2, a2) = 1{s1 = s}πθ(a1|s) log πθ(a|s) θ(s2, a2) [ 1, 1]. log pξ(s |s, a) ξ(s1, a1, s 1) = 1 ξ(s, a, s )1{(s1, a1, s 1) = (s, a, s )} [0, ξ 1 min]. 2 log pξ(s |s, a) ξ(s1, a1, s 1) ξ(s2, a2, s 2) = ξ 2(s, a, s )1{(s1, a1, s 1)=(s2, a2, s 2)=(s, a, s )} [ ξ 2 min, 0]. f(λ) λ(s1, a1) =c(s1, a1)+log λ(s1, a1) P a λ(s1, a ) +1 λ(s1, a1) P a λ(s1, a ) h cmin log(|A|λmin), cmax+1 i , where cmin = mins,a c(s, a) and cmax = maxs,a c(s, a). 2f(λ) λ(s1, a1) λ(s2, a2) =1{s1 = s2} h1{a1 = a2} λ(s1, a1) 1 P a λ(s1, a ) 1{a1 = a2} λ(s1, a1) a λ(s1, a )]2 h 1 |A|λmin 1 |A|2λ2 min , 1 λmin + 1 |A|λmin K.2 Proof of Assumptions 5, 6 and 7 about ambiguity set Ξ It is straightforward to verify that the ambiguity set Ξ = {ξ ( S)S A : ξ(s, :, :) bξ(s, :, :) p αs, s S} (p {1, }) is convex and compact, which proves Assumption 5. Assumption 6 can be proved easily by letting Ξs = {ξs ( S)A : ξs bξ(s, :, :) p αs} (p {1, }). Then we will prove Assumption 7, that is, Ξs = {ξs ( S)A : ξs bξ(s, :, :) p αs} (p {1, }) is a polyhedron. Based on Definition 1.1 and Theorem 1.26 of [9], it is equivalent to prove that Ξs is bounded (already proved above) and is an intersection of finitely many closed half-planes (obvious based on the definitions of 1 and ). K.3 Proof of Assumptions 4 and 8 We will only prove Assumption 8, since Assumption 4 can be proved in the same way. Fix any ξ Ξ and θ Θ = [ R, R]|S| |A|. Then we select any θ = θ (θ) arg minθ Θmin θ θ where Θmin := arg minθ Θ Γ(θ ) is a compact set since Γ is a continuous function. Define the following notations. λ(δ) θ,ξ = (1 δ)λθ,ξ + δλθ ,ξ for δ [0, 1] (we select δ = 1). Policy π(δ) θ,ξ defined as π(δ) θ,ξ(a|s) = λ(δ) θ,ξ(s,a) λ(δ) θ,ξ(s) where λ(δ) θ,ξ(s) = P a λ(δ) θ,ξ(s, a ) (Note that λ(δ) θ,ξ(s) λ(δ) θ,ξ(s, a) λmin > 0, so λ(δ) θ,ξ(s) can be the denominator and π(δ) θ,ξ(a|s) > 0). θ(δ) θ,ξ R|S| A with each entry defined as follows. (θ(δ) θ,ξ)s,a = log (1 δ)λθ,ξ(s, a) + δλθ ,ξ(s, a) (1 δ)λθ,ξ(s, a) exp( θs,a) + δλθ ,ξ(s, a) exp( θ s,a) which is valid since π(δ) θ,ξ(a|s) > 0 and πθ(a|s) πmin > 0. Uθ,ξ = {θ(δ) θ,ξ : δ [0, 1]} R|S| |A|. Vθ,ξ = {λ(δ) θ,ξ : δ [0, 1]} S A. Now, it remains to prove the following two statements. (P1): θ(0) θ,ξ = θ and θ(1) θ,ξ = θ . (P2): Uθ,ξ Θ = [ R, R]|S| |A|. (P3): λ(δ) θ,ξ = λθ(δ) θ,ξ,ξ. (P4): The mapping θ(δ) θ,ξ λ(δ) θ,ξ from Uθ,ξ to Vθ,ξ is a bijection and is Lipschitz continuous in both directions. (P1) obviously follows from Eq. (76). Note that (θ(δ) θ,ξ)s,a defined by Eq. (76) is a monotone function of δ [0, 1], and (P1) implies that (θ(0) θ,ξ)s,a = θs,a [ R, R] and θ(1) θ,ξ = θ s,a [ R, R]. Therefore, (θ(δ) θ,ξ)s,a [ R, R] which proves (P2). To prove (P3), rewrite Eq. (76) as follows. (θ(δ) θ,ξ)s,a = log[λ(δ) θ,ξ(s, a)] log (1 δ)λθ,ξ(s)πθ(a|s) exp( θs,a) + δλθ ,ξ(s)πθ (a|s) exp( θ s,a) = log[λ(δ) θ,ξ(s, a)] log h (1 δ)λθ,ξ(s) P a exp(θs,a ) + δλθ ,ξ(s) P a exp(θ s,a ) = log λ(δ) θ,ξ(s)π(δ) θ,ξ(a|s) log h (1 δ)λθ,ξ(s) P a exp(θs,a ) + δλθ ,ξ(s) P a exp(θ s,a ) = log π(δ) θ,ξ(a|s) + cδ(s), (77) where we denote cδ(s) := log λ(δ) θ,ξ(s) log h (1 δ)λθ,ξ(s) P a exp(θs,a ) + δλθ ,ξ(s) P a exp(θ s,a ) i . Therefore, πθ(δ) θ,ξ(a|s) = exp (θ(δ) θ,ξ)s,a a exp (θ(δ) θ,ξ)s,a = π(δ) θ,ξ(a|s) = λ(δ) θ,ξ(s, a) λ(δ) θ,ξ(s) . (78) Note that for any θ R|S| |A|, we have λθ ,ξ(s ) = X a λθ ,ξ(s , a ) h (1 γ)ρ(s ) + γ X s,a λθ ,ξ(s, a)pξ(s |s, a) i πθ (a |s ) =(1 γ)ρ(s ) + γ X s,a λθ ,ξ(s, a)pξ(s |s, a) (79) where (i) uses Lemma 2. Then we have h (1 γ)ρ(s ) + γ X s,a λ(δ) θ,ξ(s, a)pξ(s |s, a) i πθ(δ) θ,ξ(a |s ) λ(δ) θ,ξ(s , a ) (i) =πθ(δ) θ,ξ(a |s ) h (1 γ)ρ(s ) + γ X s,a λ(δ) θ,ξ(s, a)pξ(s |s, a) λ(δ) θ,ξ(s ) i (ii) = δπθ(δ) θ,ξ(a |s ) h (1 γ)ρ(s ) + γ X s,a λθ,ξ(s, a)pξ(s |s, a) λθ,ξ(s ) i + (1 δ)πθ(δ) θ,ξ(a |s ) h (1 γ)ρ(s ) + γ X s,a λθ ,ξ(s, a)pξ(s |s, a) λθ ,ξ(s ) i where (i) uses Eq. (78), (ii) uses λ(δ) θ,ξ = (1 δ)λθ,ξ + δλθ ,ξ and (iii) uses the Eq. (79) for θ {θ , θ}. Based on Lemma 2, the equality above implies (P3). Next, we prove (P4). Note that the mapping from θ to λ(δ) θ,ξ = λθ(δ) θ,ξ,ξ is Lipschitz continuous based on Lemma 3. Hence, we only need to consider its reverse mapping. If λθ ,ξ = λθ,ξ, then πθ = πθ. Hence, θ Θmin := arg minθ Θ Γ(θ ) because Γ(θ ) = maxξ Ξ f(λθ ,ξ ) can be seen as a function of πθ . Therefore, θ = θ which means Uθ,ξ = {θ} and Vθ,ξ = {λθ,ξ} are singletons. In this case, (P4) trivially holds. Therefore, we focus on the case where λθ ,ξ = λθ,ξ. Before proving (P4), we will prove the following statement. (P5) There exists a constant L > 0 such that θ θ L λθ ,ξ λθ,ξ for any θ Θ = [ R, R]|S| |A|. Define θ θ R|S| |A| such that θ s,a = θ s,a + 1 |A| P a (θs,a θ s,a ). Then it can be easily verified that πθ ,ξ = πθ ,ξ, (80) X a θ s,a = X a θs,a . (81) Note that πθ(a|s) = exp(θs,a) P a exp(θs,a ) and πθ (a|s) = exp(θ s,a) P a exp(θ s,a ), so θ s,a θs,a = log πθ (a|s) log πθ(a|s) + u(s) where u(s) := log P a exp(θ s,a ) log P a exp(θs,a ) . Then combining with Eq. (81), we obtain that u(s) = 1 |A| P a [log πθ(a |s) log πθ (a |s)]. Therefore, |θ s,a θs,a| | log πθ (a|s) log πθ(a|s)| + 1 log πθ(a |s) log πθ (a |s) (i) π 1 min|πθ (a|s) πθ(a|s)| + π 1 min |A| πθ(a |s) πθ (a |s) 2π 1 min max a πθ (a |s) πθ(a |s) 2π 1 min max a λθ ,ξ(s, a ) λθ ,ξ(s) λθ,ξ(s, a ) =2π 1 min max a h λθ ,ξ(s, a ) λθ,ξ(s, a ) λθ ,ξ(s) + λθ,ξ(s, a ) λθ ,ξ(s)λθ,ξ(s)[λθ,ξ(s) λθ ,ξ(s)] (ii) 2 |A|λminπmin max a |λθ ,ξ(s, a ) λθ,ξ(s, a )| + 2 |A|2λ2 minπmin X a |λθ ,ξ(s, a ) λθ,ξ(s, a )| 4 λθ ,ξ λθ,ξ |A|λ2 minπmin , (82) where (i) uses Eqs. (73) and (80) which imply that πθ(a|s), πθ (a|s) = πθ (a|s) [πmin, 1] for θ, θ Θ, (ii) uses λθ ,ξ(s), λθ,ξ(s) |A|λmin for θ, θ Θ as a result of Eq. (75). Based on the definition of θ , we have maxa θ s,a mina θ s,a = maxa θs,a mina θs,a 2R. Therefore, for each s S, there are three cases: R mina θ s,a maxa θ s,a R, maxa θ s,a > R and mina θ s,a < R, and we can define θ R|S| |A| as follows θ s,a, R min a θ s,a max a θ s,a R θ s,a max a θ s,a + R, max a θ s,a > R θ s,a min a θ s,a R, min a θ s,a < R It can be easily verified that the θ defined above satisfies θ Θ = [ R, R]|S| |A| (since maxa θ s,a mina θ s,a 2R) and πθ = πθ = πθ (the second = comes from Eq. (80)). Therefore, θ Θmin and thus (i) θ θ θ θ + θ θ , (83) where (i) uses θ Θmin and θ arg minθ Θmin θ θ . To further obtain the upper bound of θ θ in the above inequality, we discuss the three aforementioned cases. (Case I): When R mina θ s,a maxa θ s,a R, we have θ s,a θ s,a = 0. (Case II): When maxa θ s,a > R, we have 0 < θ s,a θ s,a = max a θ s,a R (i) max a θ s,a max a θs,a θ θ , where (i) uses θ Θ = [ R, R]|S| |A|. (Case III): When mina θ s,a < R, we have 0 < θ s,a θ s,a = min a θ s,a R (i) min a θs,a min a θ s,a θ θ , where (i) uses θ Θ = [ R, R]|S| |A|. Summarizing the above three cases, we obtain that θ θ θ θ and therefore Eq. (83) implies that θ θ θ θ + θ θ 2 θ θ (i) 8 λθ ,ξ λθ,ξ |A|λ2 minπmin def = L , (84) where (i) uses Eq. (82). This proves (P5). We consider Eq. (76) as a function of δ and take its derivative as follows. (θ(δ) θ,ξ)s,a δ = λθ ,ξ(s, a) λθ,ξ(s, a) (1 δ)λθ,ξ(s, a) + δλθ ,ξ(s, a) + λθ ,ξ(s, a) exp( θ s,a) λθ,ξ(s, a) exp( θs,a) (1 δ)λθ,ξ(s, a) exp( θs,a) + δλθ ,ξ(s, a) exp( θ s,a) (i) λθ ,ξ λθ,ξ λmin + |λθ ,ξ(s, a)[exp( θ s,a) exp( θs,a)] + exp( θs,a)[λθ ,ξ(s, a) λθ,ξ(s, a)]| λmin exp( R) (ii) λθ ,ξ λθ,ξ λmin + exp(R)|θ s,a θs,a| + exp(R)|λθ ,ξ(s, a) λθ,ξ(s, a)| λmin exp( R) (iii) λθ ,ξ λθ,ξ λmin + exp(2R) L λθ ,ξ λθ,ξ + λθ ,ξ λθ,ξ 2(L + 1) exp(2R) λmin λθ ,ξ λθ,ξ (85) where (i) uses θ, θ Θ = [ R, R]|S| |A| and Eq. (75), (ii) uses θ, θ Θ = [ R, R]|S| |A|, (iii) uses Eq. (84). Therefore, for any δ, δ [0, 1], we have θ(δ ) θ,ξ θ(δ) θ,ξ 2(L + 1) exp(2R) λmin λθ ,ξ λθ,ξ |δ δ| =2(L + 1) exp(2R) λmin λ(δ ) θ,ξ λ(δ) θ,ξ , which proves the statement (P4) and thus proves Assumption 8. Assumption 4 can be proved in the same way simply by replacing θ with any θ (ξ) arg minθ Θ f(λθ ,ξ). L Proof of Proposition 9 The estimated occupancy measure (40) is an unbiased estimator of the following truncated occupancy measure with truncation level Hλ. λ(Hλ) θ,ξ (s, a) def = (1 γ) t=0 γt Pπθ,pξ(st = s, at = a|s0 ρ). (86) Denote bλ(τ (λ)) := bλ(τ (λ); s, a) s,a S A R|S||A|, λ(Hλ) θ,ξ := λ(Hλ) θ,ξ (s, a) s,a S A R|S||A|, λθ,ξ := λθ,ξ(s, a) s,a S A R|S||A|, Then the estimation error of occupancy measure has the following upper bound. Eπθ,pξ bλ(τ (λ)) λθ,ξ 2 (i) = Varπθ,pξ bλ(τ (λ)) + Eπθ,pξ λ(Hλ) θ,ξ λθ,ξ 2 (ii) = 1 mλ Var bλ1(τ (λ) 1 ) + X t=Hλ γt Pπθ,pξ(st = s, at = a|s0 ρ) 2 (iii) 1 mλ E bλ1(τ (λ) 1 ) 2 + h (1 γ) t=Hλ γt Pπθ,pξ(st = s, at = a|s0 ρ) i (iv) 1 mλ + γ2Hλ, (87) where (i) uses E X 2 = Var X + EX 2 for random vector X := bλ(τ (λ)) λθ,ξ, (ii) uses Eqs. (1) and (86) and uses the fact that bλ defined by Eq. (40) is average among the mλ i.i.d. individual estimators bλi(τ (λ) i ; s, a) := (1 γ) PHλ 1 h=0 γh1{s(λ) i,h = s, a(λ) i,h = a} for i = 1, . . . , mλ, (iii) uses Var X E X 2 for random vector X := bλ1(τ (λ)) and Pπθ,pξ(st = s, at = a|s0 ρ) [0, 1], and (iv) uses 0 bλ1(τ (λ) i 2 P s,a bλ1(τ (λ) i ; s, a) = 1 and P s,a Pπθ,pξ(st = s, at = a|s0 ρ) = 1. Define the cost function as c := λf(λθ,ξ). The error of estimating c by bc := λf[bλ(τ (λ))] has the following upper bounds. Eπθ,pξ bc c 2 =Eπθ,pξ λf bλ(τ (λ)) λf(λθ,ξ) 2 (i) L2 λEπθ,pξ bλ(τ (λ)) λθ,ξ 2 (ii) L2 λ 1 mλ + γ2Hλ (88) Eπθ,pξ bc c 2 |S||A|Eπθ,pξ bc c 2 L2 λ|S||A| 1 mλ + γ2Hλ (89) where (i) uses Assumption 2 and (ii) uses Eq. (87). Also, c and bc have the following norm bound based on Assumption 2. max c , bc ℓλ. (90) Note that g(θ)(τ (θ), θ, ξ, c) defined by Eq. (41) (replace bc with c) is the average of the following mθ i.i.d. individual stochastic gradients. g(θ) i (τ (θ) i , θ, ξ, c) = t=0 γtc(s(θ) i,t , a(θ) i,t ) h=0 θ log πθ(a(θ) i,h | s(θ) i,h). (91) Then it can be easily seen that g(θ) i (τ (θ) i , θ, ξ, c) defined above is an unbiased estimator of the following truncated policy gradient. θf (Hθ)(λθ,ξ) = Eπθ,pξ t=0 γtc(st, at) h=0 θ log πθ(ah|sh) Also, g(θ) i (τ (θ) i , θ, ξ, c) can be bounded as follows by using Eq. (90) and Assumption 1. g(θ) i (τ (θ) i , θ, ξ, c) t=0 γt|c(s(θ) i,t , a(θ) i,t )| h=0 θ log πθ(a(θ) i,h | s(θ) i,h) t=0 (t + 1)γtℓλℓπθ ℓλℓπθ (1 γ)2 . (93) Therefore, we can prove Eq. (43) as follows, and Eq. (44) can be proved using the same logic. Eπθ,pξ g(θ)(τ (θ), θ, ξ, bc) θf(λθ,ξ) 2 3Eπθ,pξ g(θ)(τ (θ), θ, ξ, bc) g(θ)(τ (θ), θ, ξ, c) 2 + 3Eπθ,pξ g(θ)(τ (θ), θ, ξ, c) θf (Hθ)(λθ,ξ) 2 + 3 θf (Hθ)(λθ,ξ) θf(λθ,ξ) 2 (i) 3Eπθ,pξ 1 t=0 γt bc(s(θ) i,t , a(θ) i,t ) c(s(θ) i,t , a(θ) i,t ) t X h=0 θ log πθ(a(θ) i,h | s(θ) i,h) 2 + 3Varπθ,pξ[g(θ)(τ (θ), θ, ξ, c)] + 3 Eπθ,pξ t=Hθ γtc(st, at) h=0 θ log πθ(ah|sh) 2 (ii) 3Eπθ,pξ h 1 t=0 γt bc c (t + 1)ℓπθ i2 mθ Varπθ,pξ[g(θ) 1 (τ (θ) 1 , θ, ξ, c)] + 3 + X t=Hθ γtℓλℓπθ(t + 1) 2 (iii) 3L2 λ|S||A| 1 mλ + γ2Hλ ℓπθ (1 γ)2 2 + 3 ℓλℓπθγHθ[1 + Hθ(1 γ)] 3ℓ2 πθ (1 γ)4 h L2 λ|S||A| 1 mλ + γ2Hλ + ℓ2 λ mθ + ℓ2 λ[1 + Hθ(1 γ)]2γ2Hθ i , where (i) uses Eqs. (8),(41),(92) and θf (Hθ)(λθ,ξ) = Eπθ,pξg(θ)(τ (θ), θ, ξ, c), (ii) uses Eq. (90), Assumption 1 and g(θ)(τ (θ), θ, ξ, c) = 1 mθ Pmθ i=1 g(θ) i (τ (θ) i , θ, ξ, c) where {g(θ) i (τ (θ) i , θ, ξ, c)}mθ i=1 are independent, (iii) uses Eqs. (89) and (93). M Proof of Theorem 1 The policy gradient (8) is proved by Eqs. (5) and (6) of [6]. We will next prove the transition gradient (9). Under the transition kernel pξ and policy πθ, the probability of obtaining a certain trajectory τt = {sh, ah}t h=0 {st+1} with initial state distribution ρ can be expressed as follows. Pπθ,pξ(τt) = ρ(s0) πθ(ah|sh)pξ(sh+1|sh, ah) . Hence, the gradient of the log of the above probability can be computed as follows. ξ log Pπθ,pξ(τt) = h=0 ξ log pξ(sh+1|sh, ah). (94) Denote c := λf(λθ,ξ). Then the transition gradient (9) can be obtained as follows. ξf(λθ,ξ) = λf(λθ,ξ) ξλθ,ξ s,a c(s, a) ξ t=0 γt Pπθ,pξ(st = s, at = a) Z γtc(st, at)Pπθ,pξ(τt)dτt Z γtc(st, at) ξPπθ,pξ(τt)dτt Z γtc(st, at) ξPπθ,pξ(τt) Pπθ,pξ(τt)dτt t=0 Eτt Pπθ,pξ γtc(st, at) ξ log Pπθ,pξ(τt) s0 ρ (i) = Eπθ,pξ t=0 γtc(st, at) h=0 ξ log pξ(sh+1|sh, ah) where (i) uses Eq. (94). N Proof of Theorem 2 Based on Proposition 9, we define the following error terms E(ξ) j and E(θ) j (i, j {1, 2, 3, 4}) to bound the estimation errors of the stochastic gradients in lines 6, 10, 17 and 21 of Algorithm 1 respectively. E(θ) j := 3ℓ2 πθ (1 γ)4 h L2 λ|S||A| 1 m(j) λ + γ2H(j) λ + ℓ2 λ m(j) θ + ℓ2 λ[1 + H(j) θ (1 γ)]2γ2H(j) θ i , (95) E(ξ) j := 3ℓ2 pξ (1 γ)4 h L2 λ|S||A| 1 m(j) λ + γ2H(j) λ + ℓ2 λ m(j) ξ + ℓ2 λ[1 + H(j) ξ (1 γ)]2γ2H(j) ξ i . (96) Then we prove Theorem 2 in the following procedures. N.1 Convergence Rate of Inner Update Step (14) of the First Original Phase Lemma 7. Suppose Assumptions 1-4 hold. Apply the projected stochastic gradient descent step (14) in Algorithm 1 to the policy optimization problem minθ Θ f(λθ,ξk) with fixed ξk Ξ. Select stepsize α = 1 2Lθ,θ and initialization θk,0 = θ0. Then the output θk := θk,T globally converges at the following rate for any δ [0, δ]. E f(λθk,ξk) min θ Θ f(λθ,ξk) ξk (1 δ)T E f(λθ0,ξk) min θ Θ f(λθ,ξk) ξk + 4Lθ,θℓ2 λ 1δ + E(θ) 1 δLθ,θ , (97) where E(θ) 1 is defined in Eq. (95). Proof. For any θk,t Θ in the update rule (14), there exists at least one optimal policy θ k,t arg minθ Θf(λθ ,ξk) such that for any δ [0, δ], there exists θ(δ) k,t Θ that satisfies λθ(δ) k,t,ξk = (1 δ)λθk,t,ξk + δλθ k,t,ξk. Since f is convex, we have f(λθ(δ) k,t,ξk) (1 δ)f(λθk,t,ξk) + δf(λθ k,t,ξk) = (1 δ)f(λθk,t,ξk) + δ min θ Θ f(λθ,ξk). (98) Then, we analyze the optimization progress of the stochastic gradient descent step (14) as follows. f(λθk,t+1,ξk) (i) f(λθk,t,ξk) + θf(λθk,t,ξk) (θk,t+1 θk,t) + Lθ,θ 2 θk,t+1 θk,t 2 =f(λθk,t,ξk) + θf(λθk,t,ξk) g(θ) k,t (θk,t+1 θk,t) + (g(θ) k,t) (θk,t+1 θk,t) + Lθ,θ 2 θk,t+1 θk,t 2 f(λθk,t,ξk) + 1 2Lθ,θ θf(λθk,t,ξk) g(θ) k,t 2 + Lθ,θ 2 θk,t+1 θk,t 2 + (g(θ) k,t) (θk,t+1 θk,t) + Lθ,θ 2 θk,t+1 θk,t 2 f(λθk,t,ξk) + (g(θ) k,t) (θk,t+1 θk,t) + Lθ,θ θk,t+1 θk,t 2 + 1 2Lθ,θ θf(λθk,t,ξk) g(θ) k,t 2 (ii) f(λθk,t,ξk) + (g(θ) k,t) (θ(δ) k,t θk,t) + Lθ,θ θ(δ) k,t θk,t 2 + 1 2Lθ,θ θf(λθk,t,ξk) g(θ) k,t 2 f(λθk,t,ξk) + θf(λθk,t,ξk) (θ(δ) k,t θk,t) Lθ,θ 2 θ(δ) k,t θk,t 2 + g(θ) k,t θf(λθk,t,ξk) (θ(δ) k,t θk,t) + 3Lθ,θ 2 θ(δ) k,t θk,t 2 + 1 2Lθ,θ θf(λθk,t,ξk) g(θ) k,t 2 (iii) f(λθ(δ) k,t,ξk) + 1 2Lθ,θ θf(λθk,t,ξk) g(θ) k,t 2 + Lθ,θ 2 θ(δ) k,t θk,t 2 2 θ(δ) k,t θk,t 2 + 1 2Lθ,θ θf(λθk,t,ξk) g(θ) k,t 2 (iv) (1 δ)f(λθk,t,ξk) + δ min θ Θ f(λθ,ξk) + 1 Lθ,θ θf(λθk,t,ξk) g(θ) k,t 2 + 4Lθ,θℓ2 λ 1δ2, (99) where (i) and (iii) use Lθ,θ-smoothness of f(λ ,ξk) based on Proposition 3, (ii) uses the update rule (14) with stepsize α = 1 2Lθ,θ which implies that θk,t+1 arg minθ Θ (g(θ) k,t) (θ θk,t) + Lθ,θ θ θk,t 2 , (iv) uses Eq. (98) and the following inequality. θ(δ) k,t θk,t (i) ℓλ 1 λθ(δ) k,t,ξk λθk,t,ξk = ℓλ 1δ λθ k,t,ξk λθk,t,ξk (ii) where (i) uses Assumption 4 and (ii) uses Lemma 4. Rearranging Eq. (99) and taking conditional expectation, we obtain that E f(λθk,t+1,ξk) min θ Θ f(λθ,ξk) ξk (1 δ)E f(λθk,t,ξk) min θ Θ f(λθ,ξk) ξk + 1 Lθ,θ E θf(λθk,t,ξk) g(θ) k,t 2 ξk +4Lθ,θℓ2 λ 1δ2 (i) (1 δ)E f(λθk,t,ξk) min θ Θ f(λθ,ξk) ξk + 4Lθ,θℓ2 λ 1δ2 + E(θ) 1 Lθ,θ , where (i) uses Eq. (43) in Proposition 9 and E(θ) 1 defined in Eq. (95) with j = 1. Then the convergence rate (97) can be proved by iterating the above inequality as follows. N.2 Convergence Rate of E[ eΦ(eξ) 2] from the First Original Phase Lemma 8. Implement the first original phase of Algorithm 1 with stepsizes α = 1 2Lθ,θ and β = K . The inner projected stochastic gradient descent step (14) is implemented up to precision ϵ0 > 0 as follows. E f(λθk,ξk) min θ Θ f(λθ,ξk) ξk ϵ0. (100) Then, the output eξ of the first original phase has the following convergence rate. E eΦ(eξ) 2 8f 8E Φ(ξ0) Kβ + 10Lξ,ξβℓ2 ξ + 20Lξ,ξϵ0 + 20E(ξ) 2 , (101) where E(ξ) 2 is defined in Eq. (96) with j = 2. Proof. For any fixed ξ Ξ, define the optimal policy parameter θ (ξ) and the optimal utility value Φ(ξ) as follows. θ (ξ) : arg min θ Θ f(λθ,ξ) (102) Φ(ξ) := min θ Θ f(λθ,ξ) = f(λθ (ξ),ξ). (103) Since f(λθ, ) is Lξ,ξ-smooth for any θ Θ based on Proposition 3, for any (θ, ξ) Θ Ξ, f(λθ,ξ ) Lξ,ξ 2 ξ ξ 2 is a concave function of ξ . As a result, Φ(ξ ) Lξ,ξ ξ ξ 2 is a Lξ,ξ-strongly concave function of ξ and thus it has the following unique maximizer. ξ (ξ) := arg max ξ Ξ Φ(ξ ) Lξ,ξ ξ ξ 2 (104) Accordingly, we define the following Moreau envelope function (repeat Eq. (18)). eΦ(ξ) := max ξ Ξ Φ(ξ ) Lξ,ξ ξ ξ 2 = Φ[ξ (ξ)] Lξ,ξ ξ (ξ) ξ 2. (105) Based on Lemma 3.6 of [31], eΦ is Lξ,ξ-smooth with eΦ(ξ) = 2Lξ,ξ[ξ (ξ) ξ] (106) Similar to Lemma D.3 of [31], we obtain the following ascent property of the above envelope function eΦ for any k = 0, 1, . . . , K 1. E eΦ(ξk+1) ξk (i) E Φ[ξ (ξk)] Lξ,ξ ξ (ξk) projΞ ξk + βg(ξ) k 2 ξk (ii) E Φ[ξ (ξk)] Lξ,ξ(1 + τk) ξ (ξk) projΞ ξk + β ξf(λθk,ξk) 2 Lξ,ξ(1 + τ 1 k ) projΞ ξk + β ξf(λθk,ξk) projΞ ξk + βg(ξ) k 2 ξk (iii) E Φ[ξ (ξk)] Lξ,ξ(1 + τk) ξ (ξk) ξk 2 Lξ,ξβ2(1 + τk) ξf(λθk,ξk) 2 + 2Lξ,ξβ(1 + τk) ξ (ξk) ξk, ξf(λθk,ξk) Lξ,ξβ2(1 + τ 1 k ) g(ξ) k ξf(λθk,ξk) 2 ξk (iv) Φ[ξ (ξk)] Lξ,ξ(1 + τk) ξ (ξk) ξk 2 Lξ,ξβ2(1 + τk)ℓ2 ξ + 2Lξ,ξβ(1 + τk)E h f(λθk,ξ (ξk)) f(λθk,ξk) Lξ,ξ 2 ξ (ξk) ξk 2 ξk i Lξ,ξβ2(1 + τ 1 k )E(ξ) 2 (v) Φ[ξ (ξk)] Lξ,ξ(1 + τk) ξ (ξk) ξk 2 Lξ,ξβ2(1 + τk)ℓ2 ξ + 2Lξ,ξβ(1+τk) h Φ[ξ (ξk)] Φ(ξk) ϵ0 Lξ,ξ 2 ξ (ξk) ξk 2i Lξ,ξβ2(1 + τ 1 k )E(ξ) 2 (vi) eΦ(ξk) Lξ,ξτk ξ (ξk) ξk 2 Lξ,ξβ2(1 + τk)ℓ2 ξ + 2Lξ,ξβ(1 + τk) h Lξ,ξ 2 ξ (ξk) ξk 2 ϵ0 i Lξ,ξβ2(1 + τ 1 k )E(ξ) 2 (vii) eΦ(ξk) + 1 β(1 + τk) τk eΦ(ξk) 2 Lξ,ξβ2(1 + τk)ℓ2 ξ 2Lξ,ξβ(1 + τk)ϵ0 Lξ,ξβ2(1 + τ 1 k )E(ξ) 2 , (107) where (i) uses Eqs. (15) and (105), (ii) holds for any τk > 0 whose value is to be determined later, (iii) uses contraction property of projection and ξ (ξk) Ξ, (iv) uses Propositions 3-9 and the error term E(ξ) 2 defined in Eq. (96), (v) uses Eqs. (100) and (103), (vi) uses Eqs. (104)-(105) which imply that eΦ(ξk) = Φ[ξ (ξk)] Lξ,ξ ξ (ξk) ξk 2 Φ(ξk), (vii) uses Eq. (106). Taking unconditional expectation of Eq. (107) and telescoping it over k = 0, 1, . . . , K 1 with β = 1 2Lξ,ξ K 0, 1 2Lξ,ξ , 4, we obtain that k=0 E eΦ(ξk) 2 E eΦ(ξK) eΦ(ξ0) + 5K 4 Lξ,ξβ2ℓ2 ξ + 5K 2 Lξ,ξβϵ0 + KLξ,ξβ2 1 + 2 βLξ,ξ (i) f E Φ(ξ0) + 5K 4 Lξ,ξβ2ℓ2 ξ + 5K 2 Lξ,ξβϵ0 + 5KβE(ξ) 2 2 , where (i) uses the following range of eΦ(ξ) (defined in Eq. (105)) that holds for any ξ Ξ. eΦ(ξ) max ξ Ξ Φ(ξ ) = max ξ Ξ min θ Θ f(λθ,ξ ) min θ Θ max ξ Ξ f(λθ,ξ ) = f (108) eΦ(ξ) Φ(ξ) (109) As a result, E eΦ(eξ) 2 = 1 k=0 E eΦ(ξk) 2 8f 8E Φ(ξ0) Kβ + 10Lξ,ξβℓ2 ξ + 20Lξ,ξϵ0 + 20E(ξ) 2 . N.3 Convergence of the Inner Update Step (16) of the Second Corrected Phase Next we focus on the second corrected phase which aims to solve the following minimax optimization problem (repeat Eq. (19)). min θ Θ max ξ Ξ ef(θ, ξ) := f(λθ,ξ) Lξ,ξ ξ eξ 2, (110) where eξ is obtained from {ξk}K 1 k=0 uniformly at random in the first original phase. Based on Proposition 3, it can be easily verified that ef has the following smoothness properties and ef(θ, ) is Lξ,ξ-strongly concave. θ ef(θ , ξ ) θ ef(θ, ξ) Lθ,θ θ θ + Lθ,ξ ξ ξ , (111) ξ ef(θ , ξ ) ξ ef(θ, ξ) Lξ,θ θ θ + 3Lξ,ξ ξ ξ , (112) Next, we will see the convergence rate of the projected stochastic gradient ascent steps (16) to the following optimal variable, which is unique due to strong concavity of ef(θk, ). ξ k := arg max ξ Ξ ef(θk, ξ). (113) The optimization progress of each step of Eq. (16) for k = K, K + 1, . . . , K + K 1 can be bounded as follows. E ξk,t+1 ξ k 2 θk (i) E ξk,t + a g(ξ) k,t 2Lξ,ξ(ξk,t eξ) ξ k 2 θk (ii) (1 + ck)E ξk,t + a ξ ef(θk, ξk,t) ξ k 2 θk + (1 + c 1 k )E a g(ξ) k,t 2Lξ,ξ(ξk,t eξ) ξ ef(θk, ξk,t) 2 θk (iii) = (1 + ck)E ξk,t ξ k 2 + 2a ξ ef(θk, ξk,t) ξ ef(θk, ξ k), ξk,t ξ k + a2 ξ ef(θk, ξk,t) ξ ef(θk, ξ k) 2 θk + a2(1 + c 1 k )E g(ξ) k,t ξf(λθk,ξk,t) 2 θk (iv) (1 + ck)(1 2Lξ,ξa + 9L2 ξ,ξa2)E ξk,t ξ k 2 θk + a2(1 + c 1 k )E(ξ) 3 17E ξk,t ξ k 2 θk + 2E(ξ) 3 9L2 ξ,ξ , (114) where (i) uses Eq. (16), ξ k Ξ and contraction property of projection, (ii) holds for any ck > 0 whose value will be assigned later, (iii) uses ξ ef(θk, ξ k) = 0 and the definition of ef in Eq. (110), (iv) uses Proposition 9, the error term E(ξ) 3 defined in Eq. (96) as well as the 3Lξ,ξ-smoothness and Lξ,ξ-strongly concavity of ef(θk, ) (see Eq. (112)), and (v) uses a = 1 9Lξ,ξ and ck = 1 17. Iterating the unconditional expectation of Eq. (114) over t = 0, 1, . . . , T 1, we obtain that E ξk ξ k 2 θk = E ξk,T ξ k 2 θk T E ξk,0 ξ k 2 θk + 17(2E(ξ) 2 ) 9L2 ξ,ξ 16 T D2 Ξ + 34E(ξ) 3 9L2 ξ,ξ . (115) where the second denotes DΞ := supξ,ξ Ξ ξ ξ as the diameter of the compact set Ξ. N.4 Convergence Rate of E[ G(θ) b (θek, ξek) 2] Since ef(θ, ) is strongly concave, it has unique maximizer eξ (θ) and the corresponding function value eΨ(θ) defined as follows. eξ (θ) := arg max ξ Ξ ef(θ, ξ) (116) eΨ(θ) := max ξ Ξ ef(θ, ξ), (117) Furthermore, since ef(θ, ) is Lξ,ξ-strongly concave and ef has the smoothness properties (111) and (112), we can easily obtain that eξ (θ) is (Lξ,θ/Lξ,ξ)-Lipschitz and eΨ is e L := Lθ,θ + Lθ,ξLξ,θ/Lξ,ξsmooth with the following gradient, following the proof of Lemma 4.3 in [31]. 2 eΨ(θ) = 1 ef[θ, eξ (θ)] = 1f(λθ,eξ (θ)). (118) Note that for any k = K, . . . , K + K 1, the projected stochastic gradient ascent step (17) satisfies θk θk bg(θ) k projΘ θk bg(θ) k θk bg(θ) k b θk+1 θk + bg(θ) k , where (i) uses θk Θ and the definition of projection and (ii) uses the stochastic gradient descent step (17). The above inequality implies that (g(θ) k ) (θk+1 θk) 1 2b θk+1 θk 2. (119) Then, we analyze the optimization progress of the potential function (117) along the projected stochastic gradient descent step (17) as follows. (i) E h eΨ(θk) + eΨ(θk) (θk+1 θk) + e L 2 θk+1 θk 2i 2 1 ef[θ, eξ (θ)] and 1f(λθ,eξ (θ)) denote gradients with respect to only the first input argument θ. (ii) = E h eΨ(θk) + 1f(λθk,eξ (θk)) g(θ) k , θk+1 θk + (g(θ) k ) (θk+1 θk) + e L 2 θk+1 θk 2i (iii) E h eΨ(θk) + 1 g(θ) k 1f(λθk,eξ (θk)) 2 + e L 2 θk+1 θk 2 1 2b θk+1 θk 2 + e L 2 θk+1 θk 2i E h eΨ(θk) + 1 g(θ) k θf(λθk,ξk) 2 + 1 θf(λθk,ξk) 1f(λθk,eξ (θk)) 2 2b e L θk+1 θk 2i (iv) EeΨ(θk) + E(θ) 4 e L + L2 θ,ξ e L E ξk eξ (θk)) 2 1 4b E θk+1 θk 2 (v) EeΨ(θk) + E(θ) 4 e L + D2 ΞL2 θ,ξ e L T + 34L2 θ,ξE(ξ) 3 9e LL2 ξ,ξ 1 4b E θk+1 θk 2, (120) where (i) uses the e L := Lθ,θ +Lθ,ξLξ,θ/Lξ,ξ-smoothness of eΨ, (ii) uses Eq. (118), (iii) uses Cauchy Schwartz inequality and Eq. (119), (iv) uses b = 1 4e L, Propositions 3-9 and the error term E(θ) 4 defined by Eq. (95), (v) uses Eq. (115). Denote G(θ) b (θk, ξk) = 1 b θk projΘ[θk b θf(λθk,ξk)] for k = K, . . . , K + K 1. Its norm can be bounded as follows. E G(θ) b (θk, ξk) 2 b2 E θk+1 θk + b G(θ) b (θk, ξk) (θk+1 θk) 2 b2 E θk+1 θk + b G(θ) b (θk, ξk) 2 + 2 b2 E (θk+1 θk) 2 b2 E projΘ[θk bg(θ) k ] projΘ[θk b θf(λθk,ξk)] 2 h E[eΨ(θk) eΨ(θk+1)] + E(θ) 4 e L + D2 ΞL2 θ,ξ e L T + 34L2 θ,ξE(ξ) 3 9e LL2 ξ,ξ (ii) 2E g(θ) k θf(λθk,ξk) 2 h E[eΨ(θk) eΨ(θk+1)] + E(θ) 4 e L + D2 ΞL2 θ,ξ e L T + 34L2 θ,ξE(ξ) 3 9e LL2 ξ,ξ (iii) 2E(θ) 4 + 8E(θ) 4 be L + 8 b E[eΨ(θk) eΨ(θk+1)] + 8D2 ΞL2 θ,ξ be L T + 272L2 θ,ξE(ξ) 3 9be LL2 ξ,ξ , (121) where (i) uses Eqs. (17) and (120) as well as G(θ) b (θk, ξk) = 1 b θk projΘ[θk b θf(λθk,ξk)] , (ii) uses Eq. (110), and (iii) uses Proposition 9 and the error term E(θ) 4 defined by Eq. (95). By rearranging the above inequality and averaging it over k = K, K + 1, . . . , K + K 1, we obtain the convergence rate of E[ G(θ) b (θek, ξek) 2] as follows. E[ G(θ) b (θek, ξek) 2] = 1 k=K E G(θ) b (θk, ξk) 2 2E(θ) 4 + 8E(θ) 4 be L + 8 b K E[eΨ(θK) eΨ(θK+K )] + 8D2 ΞL2 θ,ξ be L T + 272L2 θ,ξE(ξ) 3 9be LL2 ξ,ξ (i) 34E(θ) 4 + 32e L K E ef(θK, eξ (θK)) ef(θK+K , eξ (θK)) + 32D2 ΞL2 θ,ξ 16 T + 1088L2 θ,ξE(ξ) 3 9L2 ξ,ξ (ii) = 34E(θ) 4 + 32e L K E f(λθK,eξ (θK)) f(λθK+K ,eξ (θK)) + 32D2 ΞL2 θ,ξ 16 T + 1088L2 θ,ξE(ξ) 3 9L2 ξ,ξ (iii) 34E(θ) 4 + 32e L K Γ(θK) f + 32D2 ΞL2 θ,ξ 16 T + 1088L2 θ,ξE(ξ) 3 9L2 ξ,ξ , (122) where (i) uses Eqs. (116)-(117) and selects the stepsize b = 1 4e L, (ii) uses Eq. (110), and (iii) uses f := minθ Θ maxξ Ξ f(λθ,ξ) and Γ(θ) := maxξ Ξ f(λθ,ξ). N.5 Convergence Rate of E[ G(ξ) a (θek, ξek) 2] Denote ψ(ξ) := minθ Θ ef(θ, ξ) = Φ(ξ) Lξ,ξ ξ eξ 2. Then, on one hand, ψ ξ (eξ) ψ(ξk) (i) = max ξ Ξ ψ(ξ) ψ(ξk) = max ξ Ξ min θ Θ ef(θ, ξ) min θ Θ ef(θ, ξk) max ξ Ξ ef(θk, ξ) min θ Θ ef(θ, ξk) (ii) = ef(θk, ξ k) min θ Θ ef(θ, ξk) (iii) ef(θk, ξk) + 3Lξ,ξ 2 ξk ξ k 2 min θ Θ ef(θ, ξk), (123) where (i) uses Eq. (104), (ii) uses Eq. (113), (iii) uses ξ ef(θk, ξ k) = 0 at the optimal variable ξ k defined by Eq. (113) and 3Lξ,ξ-smoothness of ef(θ, ) implied by Eq. (112). On the other hand, since ef(θ, ) is Lξ,ξ-strongly concave, ψ is Lξ,ξ-strongly concave. Hence, ψ(ξk) ψ ξ (eξ) + ψ ξ (eξ) ξ (eξ) ξk Lξ,ξ ξ (eξ) ξk 2 (i) =ψ ξ (eξ) Lξ,ξ ξ (eξ) ξk 2. (124) where (i) uses ψ ξ (eξ) = 0 at the unique optimizer ξ (eξ) = arg maxξ Ξ ψ(ξ) (see Eq. (104)). Then, we have G(ξ) a (θk, ξk) 2 a2 projΞ ξk + a ξf(λθk,ξk) ξk 2 a2 projΞ ξk + a ξ ef(θk, ξk) ξk projΞ ξ k + a ξ ef(θk, ξ k) ξ k 2 a2 projΞ ξk + a ξf(λθk,ξk) projΞ ξk + a ξ ef(θk, ξk) 2 a2 projΞ ξk + a ξ ef(θk, ξk) projΞ ξ k + a ξ ef(θk, ξ k) 2 + 4 a2 ξ k ξk 2 a2 projΞ ξk + a ξf(λθk,ξk) projΞ ξk + a ξ ef(θk, ξk) 2 a2 ξk + a ξ ef(θk, ξk) ξ k + a ξ ef(θk, ξ k) 2 + 4 a2 ξ k ξk 2 + 2 ξf(λθk,ξk) ξ ef(θk, ξk) 2 (ii) 8 ξ ef(θk, ξk) ξ ef(θk, ξ k) 2 + 8 a2 ξ k ξk 2 + 4 a2 ξ k ξk 2 + 2 2Lξ,ξ(ξk eξ) 2 (iii) 72L2 ξ,ξ+ 12 ξk ξ k 2 + 16L2 ξ,ξ ξk ξ (eξ) 2 + ξ (eξ) eξ 2 (iv) 72L2 ξ,ξ + 12 ξk ξ k 2+32Lξ,ξ ef(θk, ξk)+ 3Lξ,ξ 2 ξk ξ k 2 min θ Θ ef(θ, ξk) +4 eΦ(eξ) 2 (v) = 120L2 ξ,ξ + 12 ξk ξ k 2 + 32Lξ,ξ f(λθk,ξk) min θ Θ f(λθ,ξk) + 4 eΦ(eξ) 2, (125) where (i) uses projΞ ξ k + a ξ ef(θk, ξ k) ξ k = 0 based on Eq. (113), (ii) uses the definition of ef given by Eq. (110), (iii) uses 3Lξ,ξ-smoothness of ef(θk, ) based on Eq. (112), (iv) uses Eqs. (106), (123) and (124), (v) uses Eq. (110). Taking expectation of the above Eq. (125) and averaging it over k = K, . . . , K + K 1, we obtain the convergence rate of E[ G(ξ) a (θek, ξek) 2] as follows. E[ G(ξ) a (θek, ξek) 2] 120L2 ξ,ξ + 12 k=K E ξk ξ k 2 + 4E eΦ(eξ) 2 k=K E f(λθk,ξk) min θ Θ f(λθ,ξk) (i) 120L2 ξ,ξ + 12 T D2 Ξ + 34E(ξ) 2 9L2 ξ,ξ + 4 8f 8E Φ(ξ0) Kβ + 10Lξ,ξβℓ2 ξ + 20Lξ,ξϵ0 + 20E(ξ) 2 2ℓλ 1(b Lθ,θ + 1) + bℓθ E G(θ) b (θ, ξ) (ii) 120L2 ξ,ξ + 12 T D2 Ξ + 34E(ξ) 2 9L2 ξ,ξ i + 32f 32E Φ(ξ0) Kβ + 40Lξ,ξβℓ2 ξ + 80Lξ,ξϵ0 + 80E(ξ) 2 + 32Lξ,ξ 2ℓλ 1(b Lθ,θ + 1) + bℓθ 34E(θ) 4 + 32e L K Γ(θK) f + 32D2 ΞL2 θ,ξ 16 T + 1088L2 θ,ξE(ξ) 3 9L2 ξ,ξ where (i) uses Eqs. (20), (101) and (115), (ii) uses Eq. (122), (iii) uses ℓθ, ℓξ = O[(1 γ) 2], Lθ,ξ, Lξ,ξ, e L = O[(1 γ) 3] based on Proposition 3 and selects stepsizes a = 1 9Lξ,ξ = O[(1 γ)3], b = 1 4e L = O[(1 γ)3], β = 1 2Lξ,ξ K = O[K 1/2(1 γ)3]. N.6 Substituting Hyperparameters Denote δ = min 5760Lξ,ξLθ,θℓ2 λ 1 , 1 2 = O[(1 γ)6ϵ2]. Then we select the following hyperpa- rameter values. K =36ϵ 4 64Lξ,ξ f E[Φ(ξ0)] + 20ℓ2 ξ 2 = O[(1 γ) 8ϵ 4] (127) 3δ log n 1440Lξ,ξϵ 2E h Γ(θ0) min θ Θ,ξ Ξ f(λθ,ξ) io = O hlog[(1 γ) 1ϵ 1] K = 294912e LL2 ξ,ξ e L2ϵ4 Γ(θK) f 2ℓλ 1 Lθ,θ + 4e L + ℓθ 2 = O[(1 γ) 9ϵ 4] (129) T =33 log 544Lξ,ξDΞLθ,ξ e L 1ϵ 2 2ℓλ 1 Lθ,θ+4e L +ℓθ =O log[(1 γ) 1ϵ 1] (130) α = 1 2Lθ,θ (131) β = 1 2Lξ,ξ a = 1 9Lξ,ξ , (133) 4e L , (134) m(1) θ = 17280Lξ,ξℓ2 πθℓ2 λ Lθ,θδϵ2(1 γ)4 = O[(1 γ) 10ϵ 4], (135) m(1) λ =17280Lξ,ξℓ2 πθL2 λ|S||A| Lθ,θδϵ2(1 γ)4 = O[(1 γ) 10ϵ 4], (136) H(1) λ = 1 2 log(γ 1) log h17280Lξ,ξℓ2 πθL2 λ|S||A| Lθ,θδϵ2(1 γ)4 i = O hlog[(1 γ) 1ϵ 1] H(1) θ = 4 1 γ log 2 1 γ + 1 log(γ 1) log h 51840Lξ,ξℓ2 πθℓ2 λ Lθ,θδϵ2(1 γ)4 i = O hlog[(1 γ) 1ϵ 1] m(2) ξ = 148512ℓ2 λℓ2 pξ ϵ2(1 γ)4 = O[(1 γ) 4ϵ 2], (139) m(2) λ = 148512L2 λℓ2 pξ|S||A| ϵ2(1 γ)4 = O[(1 γ) 4ϵ 2], (140) H(2) λ = 1 2 log(γ 1) h148512L2 λℓ2 pξ|S||A| i = O hlog[(1 γ) 1ϵ 1] H(2) ξ = 4 1 γ log 2 1 γ + 1 log(γ 1) log h297024ℓ2 λℓ2 pξ ϵ2(1 γ)4 i = O hlog[(1 γ) 1ϵ 1] m(3) ξ = 13369344L2 θ,ξℓ2 pξℓ2 λ 2ℓλ 1 Lθ,θ + 4e L + ℓθ 2 e L2ϵ4(1 γ)4 = O[(1 γ) 10ϵ 4], (143) m(3) λ = 13369344L2 θ,ξℓ2 pξL2 λ|S||A| 2ℓλ 1 Lθ,θ + 4e L + ℓθ 2 e L2ϵ4(1 γ)4 = O[(1 γ) 10ϵ 4], (144) H(3) λ = 1 2 log(γ 1) h13369344L2 θ,ξℓ2 pξL2 λ|S||A| 2ℓλ 1 Lθ,θ + 4e L + ℓθ 2 e L2ϵ4(1 γ)4 =O hlog[(1 γ) 1ϵ 1] H(3) ξ = 4 1 γ log 2 1 γ + 1 log(γ 1) log h26738688L2 θ,ξℓ2 pξℓ2 λ 2ℓλ 1 Lθ,θ + 4e L + ℓθ 2 e L2ϵ4(1 γ)4 =O hlog[(1 γ) 1ϵ 1] m(4) θ = 3760128L2 ξ,ξℓ2 πθℓ2 λ[ 2ℓλ 1(Lθ,θ + 4e L) + ℓθ]2 e L2ϵ4(1 γ)4 = O[(1 γ) 10ϵ 4], (147) m(4) λ = 3760128L2 ξ,ξℓ2 πθL2 λ|S||A|[ 2ℓλ 1(Lθ,θ + 4e L) + ℓθ]2 e L2ϵ4(1 γ)4 = O[(1 γ) 10ϵ 4], (148) H(4) λ = 1 2 log(γ 1) h3760128L2 ξ,ξℓ2 πθL2 λ|S||A|[ 2ℓλ 1(Lθ,θ + 4e L) + ℓθ]2 e L2ϵ4(1 γ)4 =O hlog[(1 γ) 1ϵ 1] H(4) θ = 4 1 γ log 2 1 γ + 1 log(γ 1) log h7520256L2 ξ,ξℓ2 πθℓ2 λ[ 2ℓλ 1(Lθ,θ + 4e L) + ℓθ]2 e L2ϵ4(1 γ)4 =O hlog[(1 γ) 1ϵ 1] Substituting the above hyperparameter choices into Eqs. (95) and (96), we have E(θ) 1 := 3ℓ2 πθ (1 γ)4 h L2 λ|S||A| 1 m(1) λ + γ2H(1) λ + ℓ2 λ m(1) θ + ℓ2 λ[1 + H(1) θ (1 γ)]2γ2H(1) θ i 1440Lξ,ξ , (151) E(ξ) 2 := 3ℓ2 pξ (1 γ)4 h L2 λ|S||A| 1 m(2) λ + γ2H(2) λ + ℓ2 λ m(2) ξ + ℓ2 λ[1 + H(2) ξ (1 γ)]2γ2H(2) ξ i 12376, (152) E(ξ) 3 := 3ℓ2 pξ (1 γ)4 h L2 λ|S||A| 1 m(3) λ + γ2H(3) λ + ℓ2 λ m(3) ξ + ℓ2 λ[1 + H(3) ξ (1 γ)]2γ2H(3) ξ i 1114112L2 θ,ξ 2ℓλ 1 Lθ,θ + 4e L + ℓθ 2 , (153) E(θ) 4 := 3ℓ2 πθ (1 γ)4 h L2 λ|S||A| 1 m(4) λ + γ2H(4) λ + ℓ2 λ m(4) θ + ℓ2 λ[1 + H(4) θ (1 γ)]2γ2H(4) θ i 313344L2 ξ,ξ[ 2ℓλ 1(Lθ,θ + 4e L) + ℓθ]2 , (154) where we used 1 + H(k) θ (1 γ) 2γH(k) θ 2 for H(k) θ 4 1 γ log 2 1 γ (k = 1, 2, 3, 4). Lemma 7 implies that for any fixed ξ Ξ, θT obtained from the update rule (14) satisfies E f(λθT ,ξ) min θ Θ f(λθ,ξ) (i) (1 δ)T E f(λθ0,ξk) min θ Θ f(λθ,ξk) + 4Lθ,θℓ2 λ 1δ + E(θ) 1 δLθ,θ (ii) E h Γ(θ0) min θ Θ,ξ Ξ f(λθ,ξ) i exp n log(1 δ) 1 δ log h 1440Lξ,ξϵ 2E h Γ(θ0) min θ Θ,ξ Ξ f(λθ,ξ) io 1440Lξ,ξ + ϵ2 where (i) uses Lemma 7, (ii) uses Eqs. (128), (151) and δ ϵ2 5760Lξ,ξLθ,θℓ2 λ 1 , (iii) uses log(1 δ) δ for δ [0, 1/2]. Hence, the above inequality implies that ϵ0 defined by Lemma 8 satisfies ϵ0 ϵ2 480Lξ,ξ . As a result, we can prove that E G(ξ) a (θek, ξek) 2 ϵ2 and E G(θ) b (θek, ξek) 2 ϵ2 by substituting ϵ0 ϵ2 480Lξ,ξ and Eqs. (127)-(154) into the convergence rates (122) and (126). The number of samples required by Algorithm 1 is KT(m(1) λ H(1) λ + m(1) θ H(1) θ ) + K(m(2) λ H(2) λ + m(2) ξ H(2) ξ ) + K T (m(3) λ H(3) λ + m(3) ξ H(3) ξ ) + K (m(4) λ H(4) λ + m(4) θ H(4) θ ) =O[(1 γ) 8ϵ 4]O hlog[(1 γ) 1ϵ 1] i O[(1 γ) 10ϵ 4]O hlog[(1 γ) 1ϵ 1] + O[(1 γ) 8ϵ 4]O[(1 γ) 4ϵ 2]O hlog[(1 γ) 1ϵ 1] + O[(1 γ) 9ϵ 4]O log[(1 γ) 1ϵ 1] O[(1 γ) 10ϵ 4]O hlog[(1 γ) 1ϵ 1] + O[(1 γ) 9ϵ 4]O[(1 γ) 10ϵ 4]O hlog[(1 γ) 1ϵ 1] =O hlog2[(1 γ) 1ϵ 1] O Proof of Theorem 3 Note that σk = 2βkℓθ, so by the definition of Ξk := {ξ V (Ξ) : f(λθk,ξ) maxξ V (Ξ) f(λθk,ξ ) 2βkℓθ} we have f(λθk,ξ) < max ξ V (Ξ) f(λθk,ξ ) 2βkℓθ, ξ V (Ξ)/Ξk. (155) max ξ Ξ f(λθk,ξ ) f(λθk+1,ξ) (i) max ξ Ξ f(λθk,ξ ) f(λθk,ξ) 2ℓθ θk+1 θk (ii) > 2βkℓθ 2ℓθβk = 0, ξ V (Ξ)/Ξk (156) where (i) uses Eq. (10) in Proposition 3 which implies that f(λ ,ξ) is ℓθ-Lipschitz continuous, (ii) uses Eq. (155) and the update rule (22) with dk = 1. Eqs. (155) and (156) respectively imply the following two equations. Γ(θk) = max ξ Ξk f(λθk,ξ ) f(λθk,ξ k+1), (157) Γ(θk+1) = max ξ Ξk f(λθk+1,ξ ) = f(λθk+1,ξ k+1), (158) where ξ k+1 arg maxξ Ξkf(λθk+1,ξ ). Based on Proposition 7, there exists a unit descent direction edk ( edk = 1) such that 0