# batch_policy_learning_under_constraints__ecd946ff.pdf Batch Policy Learning under Constraints Hoang M. Le 1 Cameron Voloshin 1 Yisong Yue 1 When learning policies for real-world domains, two important questions arise: (i) how to efficiently use pre-collected off-policy, non-optimal behavior data; and (ii) how to mediate among different competing objectives and constraints. We thus study the problem of batch policy learning under multiple constraints, and offer a systematic solution. We first propose a flexible meta-algorithm that admits any batch reinforcement learning and online learning procedure as subroutines. We then present a specific algorithmic instantiation and provide performance guarantees for the main objective and all constraints. As part of off-policy learning, we propose a simple method for offpolicy policy evaluation (OPE) and derive PACstyle bounds. Our algorithm achieves strong empirical results in different domains, including in a challenging problem of simulated car driving subject to multiple constraints such as lane keeping and smooth driving. We also show experimentally that our OPE method outperforms other popular OPE techniques on a standalone basis, especially in a high-dimensional setting. 1. Introduction We study the problem of policy learning under multiple constraints. Contemporary approaches to learning sequential decision making policies have largely focused on optimizing some cost objective that is easily reducible to a scalar value function. However, in many real-world domains, choosing the right cost function to optimize is often not a straightforward task. Frequently, the agent designer faces multiple competing objectives. For instance, consider the aspirational task of designing autonomous vehicle controllers: one may care about minimizing the travel time while making sure the driving behavior is safe, consistent, or fuel efficient. In- 1California Institute of Technology, Pasadena, CA. Correspondence to: Hoang M. Le . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). deed, many such real-world applications require the primary objective function be augmented with an appropriate set of constraints (Altman, 1999). Contemporary policy learning research has largely focused on either online reinforcement learning (RL) with a focus on exploration, or imitation learning (IL) with a focus on learning from expert demonstrations. However, many real-world settings already contain large amounts of pre-collected data generated by existing policies (e.g., existing driving behavior, power grid control policies, etc.). We thus study the complementary question: can we leverage this abundant source of (non-optimal) behavior data in order to learn sequential decision making policies with provable guarantees on both primary objective and constraint satisfaction? We thus propose and study the problem of batch policy learning under multiple constraints. Historically, batch RL is regarded as a subfield of approximate dynamic programming (ADP) (Lange et al., 2012), where a set of transitions sampled from the existing system is given and fixed. From an interaction perspective, one can view many online RL methods (e.g., DDPG (Lillicrap et al., 2016)) as running a growing batch RL subroutine per round of online RL. In that sense, batch policy learning is complementary to any exploration scheme. To the best of our knowledge, the study of constrained policy learning in the batch setting is novel. We present an algorithmic framework for learning sequential decision making policies from off-policy data. We employ multiple learning reductions to online and supervised learning, and present an analysis that relates performance in the reduced procedures to the overall performance with respect to both the primary objective and constraint satisfaction. Constrained optimization is a well studied problem in supervised machine learning and optimization. In fact, our approach is inspired by the work of Agarwal et al. (2018) in the context of fair classification. In contrast to supervised learning for classification, batch policy learning for sequential decision making introduces multiple additional challenges. First, setting aside the constraints, batch policy learning itself presents a layer of difficulty, and the analysis is significantly more complicated. Second, verifying whether the constraints are satisfied is no longer as straightforward as passing the training data through the learned classifier. In sequential decision making, certifying constraint satisfac- Batch Policy Learning under Constraints tion amounts to an off-policy policy evaluation problem, which is a challenging problem and the subject of active research. In this paper, we develop a systematic approach to address these challenges, provide a careful error analysis, and experimentally validate our proposed algorithms. In summary, our contributions are: We formulate the problem of batch policy learning under multiple constraints, and present the first approach of its kind to solve this problem. The definition of constraints is general and can subsume many objectives. Our approach utilizes multi-level learning reductions, and we show how to instantiate it using various batch RL and online learning subroutines. We show that guarantees from the subroutines provably lift to provide end-to-end guarantees on the original constrained batch policy learning problem. While leveraging techniques from batch RL as a subroutine, we provide a refined theoretical analysis for general non-linear function approximation that improves upon the previously known sample complexity result (Munos & Szepesv ari, 2008). To evaluate off-policy learning performance and constraint satisfaction, we propose a simple new technique for off-policy policy evaluation (OPE), which is used as a subroutine in our main algorithm. We show that it is competitive to other OPE methods. We validate our algorithm and analysis with two experimental settings. First, a simple navigation domain where we consider safety constraint. Second, we consider a high-dimensional racing car domain with smooth driving and lane centering constraints. 2. Problem Formulation We first introduce notation. Let X Rd be a bounded and closed d-dimensional state space. Let A be a finite action space. Let c : X A 7 [0, C] be the primary objective cost function that is bounded by C. Let there be m constraint cost functions, gi : X A 7 [0, G], each bounded by G. To simplify the notation, we view the set of constraints as a vector function g : X A 7 [0, G]m where g(x, a) is the column vector of individual gi(x, a). Let p( |x, a) denote the (unknown) transition/dynamics model that maps state/action pairs to a distribution over the next state. Let γ (0, 1) denote the discount factor. Let χ be the initial states distribution. We consider the discounted infinite horizon setting. An MDP is defined using the tuple (X, A, c, g, p, γ, χ). A policy π Π maps states to actions, i.e., π(x) A. The value function Cπ : X 7 R corresponding to the primary cost function c is defined in the usual way: Cπ(x) = E [P t=0 γtc(xt, at) | x0 = x], over the randomness of the policy π and transition dynamics p. We similarly define the vector-value function for the constraint costs Gπ : X 7 Rm as Gπ(x) = E [P t=0 γtg(xt, at)|x0 = x]. Define C(π) and G(π) as the expectation of Cπ(x) and Gπ(x), respectively, over the distribution χ of initial states. 2.1. Batch Policy Learning under Constraints In batch policy learning, we have a pre-collected dataset, D = {(xi, ai, x i, c(xi, ai), g1:m(xi, ai)}n i=1, generated from (a set of) historical behavioral policies denoted jointly by πD. The goal of batch policy learning under constraints is to learn a policy π Π from D that minimizes the primary objective cost while satisfying m different constraints: min π Π C(π) s.t. G(π) τ (OPT) where G( ) = [g1( ), . . . , gm( )] and τ Rm is a vector of known constants. We assume that (OPT) is feasible. However, the dataset D might be generated from multiple policies that violate the constraints. 2.2. Examples of Policy Learning with Constraints Counterfactual & Safe Policy Learning. In conventional online RL, the agent needs to re-learn from scratch when the cost function is modified. Our framework enables counterfactual policy learning assuming the ability to compute the new cost objective from the same historical data. A simple example is safe policy learning (Garcıa & Fern andez, 2015). Define safety cost g(x, a) = φ(x, a, c) as a new function of existing cost c and features associated with current state-action pair. The goal here is to counterfactually avoid undesirable behaviors observed from historical data. We experimentally study this safety problem in Section 5. Other examples from the literature that belong to this safety perspective include planning under chance constraints (Ono et al., 2015; Blackmore et al., 2011). The constraint here is G(π) = E[I(x Xerror)] = P(x Xerror) τ. Multi-objective Batch Learning. Traditional policy learning (RL or IL) presupposes that the agent optimizes a single cost function. In reality, we may want to satisfy multiple objectives that are not easily reducible to a scalar objective function. One example is learning fast driving policies under multiple behavioral constraints such as smooth driving and lane keeping consistency (see Section 5). 2.3. Equivalence between Constraint Satisfaction and Regularization Our constrained policy learning framework accommodates several existing regularized policy learning settings. Regularization typically encodes prior knowledge, and has been used extensively in the RL and IL literature to improve Batch Policy Learning under Constraints learning performance. Many instances of regularized policy learning can be naturally cast into (OPT): Entropy regularized RL (Haarnoja et al., 2017; Ziebart, 2010) maps to policy-dependent constraint cost g(x) = H π( |x) , where H measures conditional entropy.1 Conservative policy improvement (Levine & Abbeel, 2014; Schulman et al., 2015; Achiam et al., 2017) is equivalent to G(π) = DKL(π, πk), where πk is some well-behaving policy. Smooth imitation learning (Le et al., 2016) is equivalent to G(π) = minh H (h, π), where H is a class of provably smooth policies and is a divergence metric. Regularizing RL with expert demonstration (Hester et al., 2018) is equivalent to G(π) = E[ℓ(π(x), π (x))], where π is the expert policy. We provide further equivalence derivation of the above examples in Appendix A. Of course, some problems are more naturally described using the regularization perspective, while others using constraint satisfaction. More generally, one can establish the equivalence between regularized and constrained policy learning via a simple appeal to Lagrangian duality as shown in Proposition 2.1 below. This Lagrangian duality also has a game-theoretic interpretation (Section 5.4 of Boyd & Vandenberghe (2004)), which serves as an inspiration for developing our approach. Proposition 2.1. Let Π be a convex set of policies. Let C : Π 7 R, C : Π 7 RK be value functions. Consider the two policy optimization tasks: Regularization: min π Π C(π) + λ G(π) Constraint: min π Π C(π) s.t. G(π) τ Assume that the Slater s condition is satisfied in the Constraint problem (i.e., π s.t. G(π) < τ). Assume also that the constraint cannot be removed without changing the optimal solution, i.e., infπ Π C(π) < infπ Π:G(π) τ C(π). Then λ > 0, τ, and vice versa, such that Regularization and Constraint share the same optimal solutions. (Proof in Appendix A.) 3. Proposed Approach To make use of strong duality, we first convexify the policy class Π by allowing stochastic combinations of policies, which effectively expands Π into its convex hull Conv(Π). Formally, Conv(Π) contains randomized policies, which we denote π = PT t=1 αtπt for πt Π and PT t=1 αt = 1. Executing a mixed π consists of first sampling one policy πt from π1:T according to distribution α1:T , and then exe- 1Constraint value function G(π) can be viewed as the expectation over discounted state visitation distribution. The lack of explicit discount rate does not intefere with our overall approach. Algorithm 1 Meta-algo for Batch Constrained Learning 1: for each round t do 2: πt Best-response(λt) 3: bπt 1 t Pt t =1 πt , bλt 1 t Pt t =1 λt 4: Lmax = maxλ L(bπt, λ) 5: Lmin = L(Best-response(bλt), bλt) 6: if Lmax Lmin ω then 7: Return bπt 8: end if 9: λt+1 Online-algorithm(π1, . . . , πt 1, πt) 10: end for cuting πt. Note that we still have E[π] = PT t=1 αt E[πt] for any first-moment statistic of interest (e.g., state distribution, expected cost). It is easy to see that the augmented version of (OPT) over Conv(Π) has a solution at least as good as the original (OPT). As such, to lighten the notation, we will equate Π with its convex hull for the rest of the paper. 3.1. Meta-Algorithm The Lagrangian of (OPT) is L(π, λ) = C(π)+λ (G(π) τ) for λ Rm +. Clearly (OPT) is equivalent to the min-max problem: min π Π max λ Rk + L(π, λ). We assume (OPT) is feasible and that Slater s condition holds (otherwise, we can simply increase the constraint τ by a tiny amount). Slater s condition and policy class convexification ensure that strong duality holds (Boyd & Vandenberghe, 2004), and (OPT) is also equivalent to the max-min problem:max λ Rk + min π Π L(π, λ). Since L(π, λ) is linear in both λ and π (due to stochastic mixture2) , strong duality is also a consequence of von Neumann s celebrated convex-concave minimax theorem for zero-sum games (Von Neumann & Morgenstern, 2007). From a game-thoeretic perspective, the problem becomes finding the equilibrium of a two-player game between the π player and the λ player (Freund & Schapire, 1999). In this repeated game, the π player minimizes L(π, λ) given the current λ, and the λ player maximizes it given the current (mixture over) π. We first present a meta-algorithm (Algorithm 1) that uses any no-regret online learning algorithm (for λ) and batch policy optimization (for π). At each iteration, given λt, the π-player runs Best-response to get the best response: Best-response(λt) = arg min π Π L(π, λt) = arg min π Π C(π) + λ t (G(π) τ). This is equivalent to a standard batch reinforcement learning problem where we learn a policy that is optimal with respect to c + λ t g. The corresponding mixed strategy is the uniform distribution over all previous πt. In response to the 2This places no restrictions on the individual policies. Individual policy can be non-linear and cost function can be non-convex. Batch Policy Learning under Constraints Algorithm 2 Constrained Batch Policy Learning Input: Dataset D = {xi, ai, x i, ci, gi}n i=1 πD. Online algorithm parameters: ℓ1 norm bound B, learning rate η 1: Initialize λ1 = ( B m+1, . . . , B m+1) Rm+1 2: for each round t do 3: Learn πt FQI(c + λ t g) // FQI with cost c + λ t g 4: Evaluate b C(πt) FQE(πt, c) // Algo 3 with πt, cost c 5: Evaluate b G(πt) FQE(πt, g) // Algo 3 with πt, cost g 6: bπt 1 t Pt t =1 πt 7: b C(bπt) 1 t Pt t =1 b C(πt ), b G(bπt) 1 t Pt t =1 b G(πt ) 8: bλt 1 t Pt t =1 λt 9: Learn eπ FQI(c + bλ t g) // FQI with cost c + bλ t g 10: Evaluate b C(eπ) FQE(eπ, c), b G(eπ) FQE(eπ, g) 11: b Lmax = max λ, λ 1=B b C(bπt) + λ h ( b G(bπt) τ) , 0 i 12: b Lmin = b C(eπ) + bλ t h ( b G(eπ) τ) , 0 i 13: if b Lmax b Lmin ω then 14: Return bπt 15: end if 16: Set zt = h ( b G(πt) τ) , 0 i Rm+1 17: λt+1[i] = B λt[i]e ηzt[i] P j λt[j]e ηzt[j] i // λ[i] the ith coordinate 18: end for π player, the λ player employs Online-algorithm, which can be any no-regret algorithm that satisfies: X t L(πt, λt) max λ t L(πt, λ) o(T) Finally, the algorithm terminates when the estimated primaldual gap is below a threshold ω (Lines 7-8). Leaving aside (for the moment) issues of generalization, Algorithm 1 is guaranteed to converge assuming: (i) Best-response gives the best single policy in the class, and (ii) Lmax and Lmin can be evaluated exactly. Proposition 3.1. Assuming (i) and (ii) above, Algorithm 1 is guaranteed to stop and the convergence depends on the regret of Online-algorithm. (Proof in Appendix B.) 3.2. Specific Instantiation of Meta-Algorithm We now focus on a specific instantiation of Algorithm 1. Algorithm 2 is our main algorithm in this paper. Policy Learning. We instantiate Best-response with Fitted Q Iteration (FQI), a model-free off-policy learning approach (Ernst et al., 2005). FQI relies on a series of reductions to supervised learning. The key idea is to approximate the true action-value function Q by a sequence {Qk F}K k=0, where F is a chosen function class. In Lines 3 & 9, FQI(c + λ g) is defined as follows. With Q0 randomly initialized, for each k = 1, . . . , K, we form a new training dataset e Dk = {(xi, ai), yi}n i=1 where: i : yi = (ci + λ gi) + γ min a Qk 1(x i, a), and (xi, ai, x i, ci, gi) D (original dataset). A supervised Algorithm 3 Fitted Q Evaluation: FQE(π, c) Input: Dataset D = {xi, ai, x i, ci}n i=1 πD. Function class F. Policy π to be evaluated 1: Initialize Q0 F randomly 2: for k = 1, 2, . . . , K do 3: Compute target yi = ci + γQk 1(x i, π(x i)) i 4: Build training set e Dk = {(xi, ai), yi}n i=1 5: Solve a supervised learning problem: Qk = arg min f F 1 n Pn i=1(f(xi, ai) yi)2 6: end for Output: b Cπ(x) = QK(x, π(x)) x regression procedure is called to solve for: Qk = arg min f F i=1 (f(xi, ai) yi)2. FQI returns the policy: πK = arg mina QK( , a). FQI has been shown to work well with several empirical domains: spoken dialogue systems (Pietquin et al., 2011), physical robotic soccer (Riedmiller et al., 2009), and cart-pole swingup (Riedmiller, 2005), and clinical treatment (Prasad et al., 2017). Another possible model-free subroutine is Least Squares Policy Iteration (LSPI) (Lagoudakis & Parr, 2003). One can also consider model-based alternatives (Ormoneit & Sen, 2002). Off-policy Policy Evaluation. A crucial difference between constrained policy learning and existing work on constrained supervised learning is the technical challenge of evaluating the objective and constraints. First, estimating b L(π, λ) (Lines 11-12) requires estimating b C(π) and b G(π). Second, any gradient-based approach to Online-learning requires passing in b G(π) τ as part of gradient estimate (line 15). This problem is known as the off-policy policy evaluation (OPE) problem: we need to evaluate b C(π) and b G(π) having only access to data D πD There are three main contemporary approaches to OPE: (i) importance weighting (IS) (Precup et al., 2000; 2001), which is unbiased but often has high-variance; (ii) regression-based direct methods (DM), which are typically model-based (Thomas & Brunskill, 2016),and can be biased but have much lower variance than IS; and (iii) doublyrobust techniques (Jiang & Li, 2016; Dud ık et al., 2011), which combine IS and DM. We propose a simple model-free technique using function approximation, called Fitted Q Evaluation (FQE). FQE is based on an iterative reductions scheme similar to FQI, but for the problem of off-policy evaluation. Algorithm 3 lays out the steps. The key difference with FQI is that the min operator is replaced by Qk 1(x i, π(x i)) (Line 3 of Algorithm 3). Each x i comes from the original D. Since we know π(x i), each e Dk is well-defined. Note that FQE can be plugged-in as a direct method if one wishes to augment the policy evaluation with a doubly-robust technique. Batch Policy Learning under Constraints Online Learning Subroutine. As L(πt, λ) is linear in λ, many online convex optimization approaches can be used for Online-algorithm. Perhaps the simpliest choice is Online Gradient Descent (OGD) (Zinkevich, 2003). We include an instantiation using OGD in Appendix G. For our main Algorithm 2, similar to (Agarwal et al., 2018), we use Exponentiated Gradient (EG) (Kivinen & Warmuth, 1997), which has a regret bound of O( p log(m)T) instead of O( m T) as in OGD. One can view EG as a variant of Online Mirror Descent (Nemirovsky & Yudin, 1983) with a softmax link function, or of Follow-the-Regularized-Leader with entropy regularization (Shalev-Shwartz et al., 2012). Gradient-based algorithms generally require bounded λ. We thus force λ 1 B using hyperparameter B. Solving (OPT) exactly requires B = . We will analyze Algorithm 2 with respect to finite B. With some abuse of notation, we augment λ into a (m+1) dimensional vector by appending B λ 1, and augment the constraint cost vector g by appending 0 (Lines 11, 12 & 15 of Algorithm 2).3 4. Theoretical Analysis 4.1. Convergence Guarantee The convergence rate of Algorithm 2 depends on the radius B of the dual variables λ, the maximal constraint value G, and the number of constraints m. In particular, we can show O( B2 ω2 ) convergence for primal-dual gap ω. Theorem 4.1 (Convergence of Algorithm 2). After T iterations, the empirical duality gap is bounded by b Lmax b Lmin 2B log(m + 1) Consequently, to achieve the primal-dual gap of ω, setting η = ω 4G2B will ensure that Algorithm 2 converges after 16B2G2 log(m+1) ω2 iterations. (Proof in Appendix B.) Convergence analysis of our main Algorithm 2 is an extension of the proof to Proposition 3.1, leveraging the no-regret property of the EG procedure (Shalev-Shwartz et al., 2012). 4.2. Generalization Guarantee of FQE and FQI In this section, we provide sample complexity analysis for FQE and FQI as standalone procedures for off-policy evaluation and off-policy learning. We use the notion of pseudo-dimension as capacity measure of non-linear function class F (Friedman et al., 2001). Pseudo-dimension dim F, which naturally extends VC dimension into the regression setting, is defined as the VC dimension of the function class induced by the sub-level set of functions of F: dim F = VC-dim({(x, y) 7 sign(f(x) y) : f F}). Pseudo-dimension is finite for a large class of function ap- 3The (m + 1)th coordinate of g is thus always satisfied. This augmentation is only necessary when executing EG. proximators. For example, Bartlett et al. (2017) bounded the pseudo-dimension of piece-wise linear deep neural networks (e.g., with Re LU activations) as O(WL log W), where W is the number of weights, and L is the number of layers. Both FQI and FQE rely on reductions to supervised learning to update the value functions. In both cases, the learned policy and evaluation policy induces a different state-action distribution compared to the data generating distribution µ. We use the notion of concentration coefficient for the worst case, proposed by (Munos, 2003), to measure the degree of distribution shift. The following standard assumption from analysis of related ADP algorithms limits the severity of distribution shift over future time steps: Assumption 1 (Concentrability coefficient of future state-action distribution). (Munos, 2003; 2007; Munos & Szepesv ari, 2008; Antos et al., 2008a;b; Lazaric et al., 2010; 2012; Farahmand et al., 2009; Maillard et al., 2010) Let P π be the operator acting on f : X A 7 R s.t. (P πf)(x, a) = R X f(x , π(x ))p(dx |x, a). Given data generating distribution µ, initial state distribution χ, for m 0 and an arbitrary sequence of stationary policies {πm}m 1 define the concentration coeffient: βµ(m) = sup π1,...,πm d(χP π1P π2 . . . P πm) We assume βµ = (1 γ)2 P m 1 mγm 1βµ(m) < . This assumption is valid for a fairly large class of MDPs (Munos, 2007). For instance βµ is finite for any finite MDP, or any infinite state-space MDP with bounded transition density.4 Having a finite concentration coefficient is equivalent the top-Lyapunov exponent Γ 0 (Bougerol & Picard, 1992), which means the underlying stochastic system is stable. We show below a simple sufficient condition for Assumption 1 (albeit stronger than necessary). Example 4.1. Consider an MDP such that for any nonstationary distribution ρ, the marginals over states satisfy ρx(x) µx(x) L (i.e., transition dynamics are sufficiently stochastic), and M : x, a : µ(a|x) > 1 M (i.e., the behavior policy is sufficiently exploratory). Then βµ LM. Recall that for a given policy π, the Bellman (evaluation) operator is defined as (TπQ)(x, a) = r(x, a) + γ R X Q(x , π(x ))p(dx |x, a). In general Tπf may not belong to F for f F. For FQE (and FQI), the main operation in the algorithm is to iteratively project TπQk 1 back to F via Qk = arg minf F f TπQk 1 . The performance 4This assumption ensures sufficient data diversity, even when the executing policy is deterministic. An example of how learning can fail without this assumption is based on the combination lock MDP (Koenig & Simmons, 1996). In this deterministic MDP example, βµ can grow arbitrarily large, and we need an exponential number of samples for both FQE and FQI. See Appendix D. Batch Policy Learning under Constraints of both FQE and FQI thus depend on how well the function class F approximates the Bellman operator. We measure the ability of function class F to approximate the Bellman evaluation operator via the worst-case Bellman error: Definition 4.1 (inherent Bellman evaluation error). Given a function class F and policy π, the inherent Bellman evaluation error of F is defined as dπ F = supg F inff F f Tπg π where π is the ℓ2 norm weighted by the state-action distribution induced by π. We are now ready to state the generalization bound for FQE: Theorem 4.2 (Generalization error of FQE). Under Assumption 1, for ϵ > 0 & δ (0, 1), after K iterations of Fitted Q Evaluation (Algorithm 3), for n = O C4 δ + dim F log C2 ϵ2 +log dim F) , we have with probability 1 δ: C(π) b C(π) γ1/2 βµ (2dπ F + ϵ) + 2γK/2C (1 γ)1/2 . This result shows a dependency on ϵ of e O( 1 ϵ2 ), compared to e O( 1 ϵ4 ) from other related ADP algorithms (Munos & Szepesv ari, 2008; Antos et al., 2008b). The price that we pay is a multiplicative constant 2 in front of the inherent error dπ F. The error from second term on RHS decays exponentially with iterations K. The proof is in Appendix E. We can show an analogous generalization bound for FQI. While FQI has been widely used, to the best of our knowledge, a complete analysis of FQI for non-linear function approximation has not been previously reported.5 Definition 4.2 (inherent Bellman optimality error). (Munos & Szepesv ari, 2008) Recall that the Bellman optimality operator is defined as (TQ)(x, a) = r(x, a) + γ R X mina A Q(x , a )p(dx |x, a). Given a function class F, the inherent Bellman error is defined as d F = supg F inff F f Tg µ, where µ is the ℓ2 norm weighted by µ, the state-action distribution induced by πD. Theorem 4.3 (Generalization error of FQI). Under Assumption 1, for ϵ > 0 & δ (0, 1), after K iterations of Fitted Q Iteration, for n = O C4 δ + dim F log C2 ϵ2 + log dim F) , we have with probability 1 δ: C C(πK) 2γ (1 γ)3 p βµ (2d F + ϵ) + 2γK/2C where πK is the policy acting greedy with respect to the returned function QK. (Proof in Appendix F.) 4.3. End-to-End Generalization Guarantee We are ultimately interested in the test-time performance and constraint satisfaction of the returned policy from Al- 5FQI for continuous action space from (Antos et al., 2008a) is a variant of fitted policy iteration and not the version of FQI under consideration. The appendix of (Lazaric & Restelli, 2011) contains a proof of FQI specifically for linear function class. gorithm 2. We now connect the previous analyses from Theorems 4.1, 4.2 & 4.3 into an end-to-end error analysis. Since Algorithm 2 uses FQI and FQE as subroutines, the inherent Bellman error terms d F and dπ F will enter our overall performance bound. Estimating the inherent Bellman error caused by function approximation is not possible in general (chapter 11 of Sutton & Barto (2018)). We assume existence of a sufficiently expressive F that can generally make d F and dπ F arbitrarily small. To simplify our end-toend analysis, consider d F = 0 and dπ F = 0, i.e., the function class F is closed under applying the Bellman operator. Assumption 2 (Bellman operator realizability). We consider function classes F sufficiently rich so that f : Tf F & Tπf F for the policies π returned by Algorithm 2. With Assumptions 1 & 2, we have the following error bound: Theorem 4.4 (Generalization guarantee of Algorithm 2). Let π be the optimal policy to (OPT). Denote V = C+BG. Let K be the number of iterations of FQE and FQI. Let bπ be the policy returned by Algorithm 2, with termination threshold ω. For ϵ > 0 & δ (0, 1), when n = O V 4 ϵ2 (log K(m+1) δ + dim F log V 2 ϵ2 + log dim F) , we have with probability at least 1 δ: C(bπ) C(π ) + ω + (4 + B)γ βµϵ + 2γK/2V , G(bπ) τ +2V + ω βµϵ+ 2γK/2V The proof is in Appendix C. This result guarantees that, upon termination of Algorithm 2, the true performance on the main objective can be arbitrarily close to that of the optimal policy. At the same time, each constraint will be approximately satisfied with high probability, assuming sufficiently large B & K, and sufficiently small ϵ. 5. Empirical Analysis We perform experiments on two different domains: a gridworld domain (from Open AI s Frozen Lake) under a safety constraint, and a challenging high-dimensional car racing domain (from Open AI s Car Racing) under multiple behavior constraints. We seek to answer the following questions in our experiments: (i) whether the empirical convergence behavior of Algorithm 2 is consistent with our theory; and (ii) how the returned policy performs with respect to the main objective and constraint satisfaction. Appendix H includes a more detailed discussion of our experiments. 5.1. Frozen Lake. Environment & Data Collection. The environment is an 8x8 grid. The agent has 4 actions N,S,E,W at each state. The main goal is to navigate from a starting position to Batch Policy Learning under Constraints the goal. Each episode terminates when the agent reaches the goal or falls into a hole. The main cost function is defined as c = 1 if goal is reached, otherwise c = 0 everywhere. We simulate a non-optimal data gathering policy πD by adding random sub-optimal actions to the shortest path policy from any given state to goal. We run πD for 5000 trajectories to collect the behavior dataset D (with constraint cost measurement specified below). Counterfactual Safety Constraint. We augment the main objective c with safety constraint cost defined as g = 1 if the agent steps into a hole, and g = 0 otherwise. We set the constraint threshold τ = 0.1, roughly 75% of the accumulated constraint cost of behavior policy πD. The threshold can be interpreted as a counterfactually acceptable probability that we allow the learned policy to fail. Results. The empirical primal dual gap b Lmax b Lmin in Figure 1 (left) quickly decreases toward the optimal gap of zero. The convergence is fast and monotonic, supporting the predicted behavior from our theory. The test-time performance in Figure 1 (middle) shows the safety constraint is always satisfied, while the main objective cost also smoothly converges to the optimal value achieved by an online RL baseline (DQN) trained without regard to the constraint. The returned policy significantly outperformed the data gathering policy πD on both main and safety cost. 5.2. Car Racing. Environment & Data Collection. The car racing environment (Open AI), is a high-dimensional domain where the state is a 96 96 3 tensor of raw pixels. The action space A = {steering gas brake} takes 12 discretized values. The goal in this episodic task is to traverse over 95% of the track, measured by a given number of tiles as a proxy for distance coverage. The agent receives a reward (negative cost) for each unique tile crossed and no reward if the agent is off-track. A small positive cost applies at every time step, with maximum horizon of 1000 for each episode. With these costs given by the environment, one can train online RL agent using DDQN (Van Hasselt et al., 2016). We collect 5000 trajectories from DDQN s randomization, resulting in data set D with 94000 transition tuples. Fast Driving under Behavioral Constraints. We study the problem of minimizing environment cost while subject to two behavioral constraints: smooth driving and lane centering. The first constraint G0 approximates smooth driving by g0(x, a) = 1 if a contains braking action, and 0 otherwise. The second constraint cost g1 measures the distance between the agent and center of lane at each time step. This is a highly challenging setup since three objectives and constraints are in direct conflict with one another, e.g., fast driving encourages the agent to cut corners and apply frequent brakes to make turns. Outside of this work, we are not aware of previous work in policy learning with 2 or more constraints in high-dimensional settings. Baseline and Procedure. As a na ıve baseline, DDQN achieves low cost, but exhibits non-smooth driving behavior (see our supplementary videos). We set the threshold for each constraint to 75% of the DDQN benchmark. We also compare against regularized batch RL algorithms (Farahmand et al., 2009), specifically regularized LSPI. We instantiate our subroutines, FQE and FQI, with multi-layered CNNs. We augment LSPI s linear policy with non-linear features derived from a well-performing FQI model. Results. The returned mixture policy from our algorithm achieves low main objective cost, comparable with online RL policy trained without regard to constraints. After several initial iterations violating the braking constraint, the returned policy - corresponding to the appropriate λ tradeoff - satisties both constraints, while improving the main objective. The improvement over data gathering policy is significant for both constraints and main objective. Regularized policy learning is an alternative approach to (OPT) (section 2). We provide the regularized LSPI baseline the same set of λ found by our algorithm for one-shot regularized learning (Figures 2 (left & middle)). While regularized LSPI obtains good performance for the main objective, it does not achieve acceptable constraint satisfaction. By default, regularized policy learning requires parameter tuning heuristics. In principle, one can perform a grid-search over a range of parameters to find the right combination - we include such an example for both regularized LSPI and FQI in Appendix H. However, since our objective and constraints are in conflict, main objective and constraint satisfaction of policies returned by one-shot regularized learning are sensitive to step changes in λ. In constrast, our approach is systematic, and is able to avoid the curse-of-dimensionality of brute-force search that comes with multiple constraints. In practice, one may wish to deterministically extract a single policy from the returned mixture for execution. A de-randomized policy can be obtained naturally from our algorithm by selecting the best policy from the existing FQE s estimates of individual Best-response policies. 5.3. Off-Policy Evaluation The off-policy evaluation by FQE is critical for updating policies in our algorithm, and is ultimately responsible for certifying constraint satisfaction. While other OPE methods can also be used in place of FQE, we find that the estimates from popular methods are not sufficiently accurate in a high-dimensional setting. As a standalone comparison, we select an individual policy and compare FQE against PDIS (Precup et al., 2000), DR (Jiang & Li, 2016) and WDR (Thomas & Brunskill, 2016) with respect to the constraint cost evaluation. To compare both accuracy and Batch Policy Learning under Constraints Figure 1. Frozen Lake Results. (Left) Empirical duality gap of algorithm 2 vs. optimal gap. (Middle) Comparison of returned policy and others w.r.t. (top) main objective value and (bottom) safety constraint value. (Right) FQE vs. other OPE methods on a standalone basis. Figure 2. Car Racing Results. (Left) & (Middle) (Lower is better) Comparing our algorithm, regularized LSPI, online RL w/o constraints, behavior policy πD w.r.t. main cost objectives and two constraints. (Right) FQE vs. other OPE methods on a standalone basis. data-efficiency, for each domain we randomly sample different subsets of dataset D (from 10% to 100% transitions, 30 trials each). Figure 1 (right) and 2 (right) illustrate the difference in quality. In the Frozen Lake domain, FQE performs competitively with the top baseline method (DR and WDR), converging to the true value estimate as the data subsample grows close to 100%. In the high-dimensional car domain, FQE signficantly outperforms other methods. 6. Other Related Work Constrained MDP (CMDP). Among the most important techniques for solving CMDP are the Lagrangian approach and solving the dual LP program via occupation measure(Altman, 1999). However, these approaches require known MDP, and small state dimension so that solving via an LP is tractable. More recently, the constrained policy optimization approach (CPO) by (Achiam et al., 2017) learns a policy when the model is not initially known. The focus of CPO is on online safe exploration, and thus is not directly comparable to our setting. Other approaches (Cheng et al., 2019; Dalal et al., 2018) address safe exploration by building the constraint directly into the policy. Multi-objective Reinforcement Learning (MORL). (Van Moffaert & Now e, 2014; Roijers et al., 2013) Approaches to MORL have largely focused on approximating the Pareto frontier that trades-off competing objectives (Van Moffaert & Now e, 2014; Roijers et al., 2013). The underlying approach to MORL frequently relies on linear or non-linear scalarization of rewards to heuristically turns the problem into a standard RL problem. Our proposed approach represents another systematic paradigm to solve MORL, whether in batch or online settings. 7. Discussion and Conclusion Our implementation complies with the steps laid out in Algorithm 2. In very large scale or high-dimensional problems, one could consider a noisy update version for both policy learning and evaluation. We leave the theorerical and practical exploration of this extension to future work. In our high-dimensional domain with long horizon, our proposed FQE algorithm for OPE achieves strong results. More extensive comparisons between FQE and other contemporary OPE methods deserve further study. We have presented a systematic approach for batch policy learning under multiple constraints. Our problem formulation can accommodate general definition of constraints, as partly illustrated by our experiments. We provide guarantees for our algorithm for both the main objective and constraint satisfaction. Our empirical results show a promise of making constrained batch policy learning applicable for real-world domains, where behavior data is abundant. Batch Policy Learning under Constraints Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. In International Conference on Machine Learning, pp. 22 31, 2017. Agarwal, A., Beygelzimer, A., Dud ık, M., Langford, J., and Wallach, H. A reductions approach to fair classification. In International Conference on Machine Learning, 2018. Altman, E. Constrained Markov decision processes, volume 7. CRC Press, 1999. Antos, A., Szepesv ari, C., and Munos, R. Fitted q-iteration in continuous action-space mdps. In Advances in neural information processing systems, pp. 9 16, 2008a. Antos, A., Szepesv ari, C., and Munos, R. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1): 89 129, 2008b. Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearlytight vc-dimension bounds for piecewise linear neural networks. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT 2017), 2017. Blackmore, L., Ono, M., and Williams, B. C. Chance-constrained optimal path planning with obstacles. IEEE Transactions on Robotics, 27(6):1080 1094, 2011. Bougerol, P. and Picard, N. Strict stationarity of generalized autoregressive processes. The Annals of Probability, pp. 1714 1730, 1992. Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004. Cheng, R., Verma, A., Orosz, G., Chaudhuri, S., Yue, Y., and Burdick, J. W. Control regularization for reduced variance reinforcement learning. In International Conference on Machine Learning, 2019. Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., and Tassa, Y. Safe exploration in continuous action spaces. ar Xiv preprint ar Xiv:1801.08757, 2018. Dud ık, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 1097 1104. Omnipress, 2011. Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503 556, 2005. Farahmand, A. M., Ghavamzadeh, M., Mannor, S., and Szepesv ari, C. Regularized policy iteration. In Advances in Neural Information Processing Systems, pp. 441 448, 2009. Farajtabar, M., Chow, Y., and Ghavamzadeh, M. More robust doubly robust off-policy evaluation. ar Xiv preprint ar Xiv:1802.03493, 2018. Freund, Y. and Schapire, R. E. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29: 79 103, 1999. Friedman, J., Hastie, T., and Tibshirani, R. The elements of statistical learning. Springer, 2001. Garcıa, J. and Fern andez, F. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015. Guo, Z., Thomas, P. S., and Brunskill, E. Using options and covariance testing for long horizon off-policy policy evaluation. In Advances in Neural Information Processing Systems, pp. 2492 2501, 2017. Gy orfi, L., Kohler, M., Krzyzak, A., and Walk, H. A distributionfree theory of nonparametric regression. Springer Science & Business Media, 2006. Ha, D. and Schmidhuber, J. World models. ar Xiv preprint ar Xiv:1803.10122, 2018. Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pp. 1352 1361, 2017. Haussler, D. Sphere packing numbers for subsets of the boolean n-cube with bounded vapnik-chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217 232, 1995. Henaff, M., Canziani, A., and Le Cun, Y. Model-predictive policy learning with uncertainty regularization for driving in dense traffic. ar Xiv preprint ar Xiv:1901.02705, 2019. Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., et al. Deep q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652 661, 2016. Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pp. 267 274, 2002. Kivinen, J. and Warmuth, M. K. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 63, 1997. Koenig, S. and Simmons, R. G. The effect of representation and knowledge on goal-directed exploration with reinforcementlearning algorithms. Machine Learning, 22(1-3):227 250, 1996. Lagoudakis, M. G. and Parr, R. Least-squares policy iteration. Journal of machine learning research, 4(Dec):1107 1149, 2003. Lange, S., Gabel, T., and Riedmiller, M. Batch reinforcement learning. In Reinforcement learning, pp. 45 73. Springer, 2012. Lazaric, A. and Restelli, M. Transfer from multiple mdps. In Advances in Neural Information Processing Systems, pp. 1746 1754, 2011. Lazaric, A., Ghavamzadeh, M., and Munos, R. Finite-sample analysis of lstd. In ICML-27th International Conference on Machine Learning, pp. 615 622, 2010. Lazaric, A., Ghavamzadeh, M., and Munos, R. Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13(Oct):3041 3074, 2012. Batch Policy Learning under Constraints Le, H. M., Kang, A., Yue, Y., and Carr, P. Smooth imitation learning for online sequence prediction. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pp. 680 688. JMLR. org, 2016. Lee, W. S., Bartlett, P. L., and Williamson, R. C. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Transactions on Information Theory, 42(6):2118 2132, 1996. Levine, S. and Abbeel, P. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pp. 1071 1079, 2014. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2016. Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pp. 5361 5371, 2018. Maillard, O.-A., Munos, R., Lazaric, A., and Ghavamzadeh, M. Finite-sample analysis of bellman residual minimization. In Proceedings of 2nd Asian Conference on Machine Learning, pp. 299 314, 2010. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2012. Montgomery, W. H. and Levine, S. Guided policy search via approximate mirror descent. In Advances in Neural Information Processing Systems, pp. 4008 4016, 2016. Munos, R. Error bounds for approximate policy iteration. In ICML, volume 3, pp. 560 567, 2003. Munos, R. Performance bounds in l p-norm for approximate value iteration. SIAM journal on control and optimization, 46(2): 541 561, 2007. Munos, R. and Szepesv ari, C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815 857, 2008. Nemirovsky, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. Wiley, 1983. Oh, J., Guo, Y., Singh, S., and Lee, H. Self-imitation learning. In International Conference on Machine Learning, 2018. Ono, M., Pavone, M., Kuwata, Y., and Balaram, J. Chanceconstrained dynamic programming with application to riskaware robotic space exploration. Autonomous Robots, 39(4): 555 571, 2015. Ormoneit, D. and Sen, S. Kernel-based reinforcement learning. Machine learning, 49(2-3):161 178, 2002. Pietquin, O., Geist, M., Chandramohan, S., and Frezza-Buet, H. Sample-efficient batch reinforcement learning for dialogue management optimization. ACM Transactions on Speech and Language Processing (TSLP), 7(3):7, 2011. Prasad, N., Cheng, L.-F., Chivers, C., Draugelis, M., and Engelhardt, B. E. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. ar Xiv preprint ar Xiv:1704.06300, 2017. Precup, D., Sutton, R. S., and Singh, S. P. Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 759 766. Morgan Kaufmann Publishers Inc., 2000. Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal difference learning with function approximation. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 417 424. Morgan Kaufmann Publishers Inc., 2001. Riedmiller, M. Neural fitted q iteration first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317 328. Springer, 2005. Riedmiller, M., Gabel, T., Hafner, R., and Lange, S. Reinforcement learning for robot soccer. Autonomous Robots, 27(1):55 73, 2009. Roijers, D. M., Vamplew, P., Whiteson, S., and Dazeley, R. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67 113, 2013. Ross, S. and Bagnell, J. A. Reinforcement and imitation learning via interactive no-regret learning. ar Xiv preprint ar Xiv:1406.5979, 2014. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889 1897, 2015. Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 4 (2):107 194, 2012. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Swaminathan, A. and Joachims, T. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16(1):1731 1755, 2015. Thomas, P. and Brunskill, E. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139 2148, 2016. Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In AAAI, volume 2, pp. 5. Phoenix, AZ, 2016. Van Moffaert, K. and Now e, A. Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15(1):3483 3512, 2014. Von Neumann, J. and Morgenstern, O. Theory of games and economic behavior (commemorative edition). Princeton university press, 2007. Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Ph D thesis, CMU, 2010. Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pp. 1433 1438. Chicago, IL, USA, 2008. Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928 936, 2003.