# batch_reinforcement_learning_with_hyperparameter_gradients__12f64e43.pdf Batch Reinforcement Learning with Hyperparameter Gradients Byung-Jun Lee * 1 Jongmin Lee * 1 Peter Vrancx 2 Dongho Kim 2 Kee-Eung Kim 2 3 We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the estimation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement. 1. Introduction In many real-world applications of reinforcement learning (RL), exploratory behavior can be very costly when the agent interacts with the environment. For example, it would be unreasonable to deploy an ϵ-greedy policy for autonomous vehicles, industrial plants, and clinical treatments, let alone many others. One of the common and straightforward practices in these scenarios is to build a simulator from collected data, train the agent with the simulated environment by allowing to make as much exploration as needed, and then deploy a fully optimized policy into the *Equal contribution 1School of Computing, KAIST, Daejeon, South Korea 2PROWLER.io 3Graduate School of AI, KAIST, Daejeon, South Korea. Correspondence to: Byung-Jun Lee , Jongmin Lee , Kee-Eung Kim . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). real environment. However, this approach requires a lot of human effort including domain expertise in building a faithful simulator that warrants a successful performance in the real environment. This paper concerns with such a scenario, often referred to as batch RL: optimize policy only from a fixed batch of data, without further interaction with the environment or a high-fidelity simulator. Since the policy being optimized would be different from the policy used for data collection, batch RL algorithms are mostly founded on techniques in off-policy RL algorithms. They are designed to learn when the behavior policy (policy used to collect experience) differs from the estimation policy (policy we aim to learn). Recent off-policy policy optimization algorithms, e.g. (Lillicrap et al., 2016; Munos et al., 2016; Haarnoja et al., 2018), have shown to achieve remarkable sample efficiency in standard benchmark tasks. However, they still assume continuous interaction with the environment with the behavior policy being improved closely together with the estimation policy. On the other hand, the batch RL setting assumes the behavior policy being fixed throughout the policy optimization. This difference is considered a fundamental challenge for batch RL: as we optimize the estimation policy, it would differ more from the behavior policy, leading to severe covariate shift in the batch data. As the policy optimization frequently relies on function approximators trained on the data distributed by the behavior policy, the optimization could actually get worse due to the inevitable generalization error of the function approximators. Thus the challenge is about estimating how much we can trust the policy evaluation and safely optimize the policy, rather than about improving the sample efficiency as in standard off-policy RL scenarios, as exemplified by recent work on batch RL (Laroche et al., 2019; Fujimoto et al., 2019). In this paper, we introduce a novel batch RL framework that uses the validation data to estimate the reliability of policy updates. Model selection and hyperparameter tuning with a held-out dataset is a standard practice in supervised learning, but has not been adapted to RL, to the best of our knowledge. The contribution of this paper is two-fold: first, we present a generalized KL-regularized RL framework that effectively constrains the distance of the estimation policy from the behavior policy differently per state and stabilizes the training process. Second, we present BOPAH (batch optimization of Batch Reinforcement Learning with Hyperparameter Gradients policy and hyperparameter) that optimizes the hyperparameter in the KL-regularized RL objective via the hypergradient (i.e. hyperparameter gradient) on the validation data. We present a model-based algorithm assuming tabular tasks as well as a model-free algorithm for continuous control tasks. 2. Related Works The key concept in recent batch RL algorithms is to improve upon the baseline policy used for data collection. Petrik et al. (2016) consider the robust policy improvement over the baseline policy in the worst-case scenario. Laroche et al. (2019) presents SPIBB that bootstraps the estimation policy with the behavior policy in the state-action pairs that are not observed enough. While both algorithms are proven to be safe with their finite sample bounds, the hyperparameter values that guarantee a safe improvement with high probability can be too conservative to be used in practice. In the case of continuous state and action spaces, Fujimoto et al. (2019) propose BCQ where the actor can only perturb the behavior policy by a limited amount. Kumar et al. (2019) provide a bound on suboptimality of an estimation policy concerning its support, and present BEAR, which imposes an MMD constraint between the estimation policy and the behavior policy. Siegel et al. (2020) use KL constraint and propose ABM that imposes advantage-weighting when estimating behavior policy from the batch data, which filters out trajectories that would lead to worse performance than the current policy. They have empirically shown to robustly improve over the baseline policy in continuous control tasks. However, the hyperparameters that control the risk play a crucial role in the performance, which must be hand-tuned in practice. Reinforcement learning with KL regularization (Todorov, 2007; Kappen et al., 2012; Schulman et al., 2017; Fox et al., 2016; Galashov et al., 2019) or KL constraint (Schulman et al., 2015; Achiam et al., 2017; Sun et al., 2018) have been extensively studied. Theoretical analyses similar to the bounds we propose have also been presented in a model-based RL context (Sun et al., 2018; Janner et al., 2019). State-independent KL regularization has also been employed in the context of batch RL (Jaques et al., 2019; Wu et al., 2019) with the hand-tuned hyperparameter. Nevertheless, our generalization of KL-regularized RL to statedependent regularization provides a unique insight on how they apply to batch RL. Agarwal et al. (2020) demonstrate that recent off-policy deep RL algorithms, without correcting distribution mismatch, trained on sufficiently large and diverse offline datasets can result in high quality policies. In contrast, the critical assumption we make in this paper is that the data collection policy is determined by domain-specific re- quirements, rather than selected freely to mitigate problems related to batch RL. In this situation, addressing the distribution mismatch is essential. Lastly, we will adopt the gradient-based hyperparameter optimization approach in supervised learning. When it is possible to obtain the gradient of the model selection criterion with respect to hyperparameters, gradient-based optimization of hyperparameter (Bengio, 2000; Maclaurin et al., 2015) is able to efficiently optimize a large number of hyperparameters, outperforming Bayesian optimization models (Pedregosa, 2016). We will show how hypergradient computations can be formulated in terms of policy evaluation in the RL context. 3. Preliminaries We consider the environment modeled as a Markov decision process (MDP), M = S, A, P, R, d0, γ , where S is the state space, A is the action space, P(st+1|st, at) is the transition probability, rt = R(st, at) R is the immediate reward function, d0(s) = p(s0 = s) is the initial state distribution, and γ (0, 1) is the discount factor. We denote dπ M(s) = (1 γ) P t=0 γtp(st = s|π, M) the discounted state marginal of policy π in the MDP M. The state and the action value functions of policy π on MDP M are denoted by V π M(s) and Qπ M(s, a) respectively. We adopt the discounted return objective, i.e. V π M(s) = Eπ,M|s[P t=0 γtrt] and Qπ M(s, a) = Eπ,M|s,a[P t=0 γtrt]. We measure the performance of a policy π on MDP M by the expectation of the value function under the initial state distribution, ρM(π) = Es0 d0[V π M(s0)]. The objective of policy optimization is to find the optimal policy π that maximizes the expected discounted return, π = arg maxπ ρM(π). When the model M is available, the value of policy π can be computed by iteratively applying the Bellman backup operator T π M: T π MQ(s, a) = R(s, a) + γEs P ( |s,a)[V (s )], where V (s) = Ea π( |s)[Q(s, a)], which is a contraction mapping and has a unique fixed point solution Qπ M. In batch RL, the agent learns from the fixed dataset of experiences D = {(si, ai, s i, ri)}N i=0 without direct interaction with the environment. We will refer to the policy µ used for the data collection as the behavior policy and the policy π being optimized as the estimation policy, borrowing the terminology in off-policy RL. If we only use samples in D to compute the expectation of the Bellman backup operator T π M, the resulting approximate operator T π c M has a unique fixed point solution Qπ c M, which is a state-action value function on a Maximum Likelihood Estimation (MLE) MDP Batch Reinforcement Learning with Hyperparameter Gradients c M = S, A, b P, R, d0, γ , where b P is the maximum likelihood estimate of P by D. We also use the notation dπ c M to denote the discounted state marginal of a policy π under the MLE MDP. 4. Generalized KL-Regularization for Batch Reinforcement Learning In a naive approach to batch RL, we could build an MDP model c M from the batch data and optimize the policy using the model. However, the resulting policy may fail to produce any improvement over the behavior policy or even perform severely worse (Petrik et al., 2016; Laroche et al., 2019). For safe RL with batch data, we will first derive a policy improvement bound for model-based RL, which naturally yields a regularization function for batch RL. Our derivation starts with bounding the policy evaluation error incurred by the model error and the distribution shift of the policy. Theorem 4.1. Let ϵπ M = Es dπ M TVπ,µ s , ϵπ c M = Es dπ c M TVπ,µ s , ϵP M = Es dµ M a µ h TVP, b P s,a i where TVp,q x denotes the total variation distance between p( |x) and q( |x). Suppose the reward function is bounded |R(s, a)| Rmax for all s, a and known to the agent for simplicity. For any policies π, µ, and an estimated MLE MDP c M, the difference of policy evaluation is bounded by: |ρM(π) ρc M(π)| c1 ϵπ M + ϵπ c M + c2ϵP M (1) where c1 = 2Rmax (1 γ)2 and c2 = 2γRmax Since the error ϵP M only depends on the dataset and is not directly controllable during batch RL policy optimization, we can gather only relevant terms and formulate the following constrained optimization similar to (Schulman et al., 2015; Achiam et al., 2017): π δ = arg max π ρc M(π) (2) s.t. Es dπ M [KLπ,µ s ] δ, Es dπ c M [KLπ,µ s ] δ, where KLp,q x denotes the KL-divergence between p( |x) and q( |x). Then, we can provide a policy improvement bound for π δ in Eq. (2), which can be derived from Theorem 4.1: Corollary 4.1. The negative baseline regret (Petrik et al., 2016) of π δ, which is the performance improvement by adopting π δ instead of the baseline policy µ on the true environment M, is lower bounded by: ρM(π δ) ρM(µ) ρc M(π δ) ρc M(µ) c1 with c1 and c2 defined in Theorem 4.1. This lower bound has two competing factors with respect to δ: increasing δ would enlarge the feasibility region and thus increase the objective ρc M(π), but this will also make the estimation of expected return under c M unreliable since the estimation error will increase. Most of the off-policy RL and batch RL algorithms have this trade-off captured by hyperparameters (Schulman et al., 2015; Petrik et al., 2016; Laroche et al., 2019). Now, instead of using KL-divergence as constraints in the policy optimization, we reformulate the problem as an unconstrained optimization where the KL-divergence acts as a regularization: eρc M(π) = ρc M(π) α Es dπ M [KLπ,µ s ] + Es dπ c M [KLπ,µ s ] t=0 γt rt α dπ M(st) dπ c M(st) + 1 KLπ,µ st # This objective is essentially a variation of KL-regularized RL (Todorov, 2007; Kappen et al., 2012; Schulman et al., 2017; Fox et al., 2016; Galashov et al., 2019), which considers the expected KL-divergence both in the true MDP and the estimated MDP. Note that the term (dπ M(st)/dπ c M(st)+1) is very hard to estimate without access to the true model M. We thus work with the objective max π Eπ, b P t=0 γt rt αβ(st)KLπ,µ st # where α is the state-independent hyperparameter and β(s) s are the state-dependent hyperparameters. We will denote α(s) = αβ(s) for brevity. In the later part of the paper, we will automatically tune α(s) using held-out validation set. 4.1. KL-Regularized Policy Iteration In this section, we derive batch policy iteration with the generalized KL-regularization that alternates between policy evaluation and policy improvement, a planning algorithm for Eq. (3) with fixed yet arbitrary hyperparameters α(s). We first present the iterative policy evaluation method by defining KL-regularized Bellman operator T π KL. For fixed policy π, T π KL e Q(s, a) = R(s, a) + Es e V (s ) (4) where e V (s) = α(s)KLπ,µ s + Ea π e Q(s, a) Lemma 4.1. (KL-Regularized Policy Evaluation) For a fixed π with KLπ,µ s < s, the backup operator T π KL is a contraction mapping and has an unique fixed point solution T π KL e Qπ = e Qπ. In other words, for any e Q0 : S A R, define e Qk+1 = T π KL e Qk. Then the sequence e Qk converges to KL-regularized Q-value function of π as k . Batch Reinforcement Learning with Hyperparameter Gradients KL-regularized value functions have the following interpretations: e V π(s) = Eπ|s t=0 γt rt α(st)KLπ,µ st # e Qπ(s, a) = Eπ|s,a t=1 γt rt α(st)KLπ,µ st # In the policy improvement step, we can compute a improved policy by weighting exponential e Qπ and µ. However, when dealing with continuous state and action encountered in the later section, we may want to optimize our policy only within a set of tractable distributions Π (e.g. a set of Gaussian distributions), which requires a projection of the updated policy distribution onto Π. One of the simple ways is to adopt the information projection that minimizes the KLdivergence to the target distribution as in (Haarnoja et al., 2018): πnew( |s) = arg min π Π KL π ( |s) exp e Qπ(s, ) α(s) µ( |s) where Zπ(s) is the normalization constant. The following lemma shows that πnew computed by Eq. (5) always improves the value over π. Lemma 4.2. (KL-Regularized Policy Improvement) Given a policy π Π and its value function e Qπ, if we update the new policy πnew by Eq. (5), then e Qπnew(s, a) e Qπ(s, a) s, a. Lemma 4.1 and 4.2 suggest a full algorithm: the KLregularized policy iteration alternates between the KLregularized policy evaluation of Eq. (4) and the KLregularized policy improvement of Eq. (5), and it is guaranteed to converge to the optimal policy π within the set of Π. Theorem 4.2. (KL-Regularized Policy Iteration) Suppose that |R(s, a)| Rmax and KLπ,µ s < . Starting from any π0 Π, the sequence of the value functions e Qπk and the improved policies πk+1 converge to the optimal value function and the optimal policy π Π, i.e. limk e Qπk(s, a) e Qπ(s, a) for any π Π, s S, and a A. Our theoretical result can be seen as an extension of those in entropy-regularized RL (Haarnoja et al., 2018) to KLregularized RL, where the regularization parameter is arbitrarily given per state. 5. Batch Optimization of Policy and Hyperparameter (BOPAH) In supervised learning, we commonly adopt regularization to select a model that generalizes well to unseen data. This is typically captured by a set of hyperparameters that balances between approximation and generalization (i.e. overfitting vs. underfitting). Commonly, the model parameters θ are optimized using the following objective function with training data Dtrain: θ α = arg max θ L(θ, Dtrain) Eα(θ) (6) where L measures how well the model performs on Dtrain (e.g. log-likelihood L(θ, Dtrain) = log p(Dtrain|θ) if probabilistic model), and Eα is a regularization term that penalizes complex models to prevent overfitting (e.g. L2 regularization Eα(θ) = α θ 2 2) where α is the regularization parameter that balances between approximation and generalization. Then, the regularization parameter α is treated as a hyperparemter and optimized using the held-out validation dataset Dvalid which is mutually exclusive to Dtrain: α = arg max α L(θ α, Dvalid) (7) Optimizing the hyperparameter is often done using simple grid or random search (Bergstra & Bengio, 2012), but these simple methods scale poorly with the number of hyperparameters. Instead, our approach adopts the gradientbased hyperparameter optimization method (Bengio, 2000; Maclaurin et al., 2015; Pedregosa, 2016), known to be capable of many hyperparameters by using local information about the objective function, assuming that Eq. (7) is differentiable. In this section, we introduce Batch Optimization of Policy and Hyperparameter (BOPAH), a method for batch reinforcement learning that aims to achieve the best possible performance improvement on the (not fully known) true environment. BOPAH adopts KL-regularization in policy optimization. 5.1. Model-based BOPAH BOPAH starts by dividing the entire batch data D = {(si, ai, s i, ri)}N i=1 into two mutually exclusive sets Dtrain and Dvalid. Each of the split datasets constructs the train (MLE) MDP c Mtrain and the valid (MLE) MDP c Mvalid respectively. The policy is then optimized on the train MDP c Mtrain with the state-dependent KL-regularization whose introduction was justified in Section 4: π α = arg max π Eπ,c Mtrain t=0 γt rt α(st)KLπ,µ st # Batch Reinforcement Learning with Hyperparameter Gradients 0.0 0.05 0.1 α (scalar) 0.75 ρM(π ) ρM(πBasic RL) ρM(π α) ρ ˆ Mtrain(π α) ρ ˆ Mvalid(π α) (a) Single hyperparameter: α(s) = α s. 0.0 0.05 0.1 α1 Train MDP: ρc Mtrain(π α) 0.0 0.05 0.1 α1 Valid MDP: ρc Mvalid(π α) 0.0 0.05 0.1 α1 True MDP: ρM(π α) (b) Two hyperparameters: α(sodd) = α1 and α(seven) = α2. Figure 1. Experimental result on a random MDP, where S = {s1, . . . s20}, |A| = 4, and α controls the degree of regularization. All the results are obtained by averaging over 300 trials. The symbol denotes the value of α that performs the best in the valid MDP. The symbol denotes the optimal value of α for the true MDP. which can be solved by KL-regularized policy iteration. In Eq. (8), if the hyperparameters α(s) s are close to zero, the resulting policy becomes the optimal policy of the unregularized train MDP c Mtrain, which would be overfitting to Dtrain. In contrast, if α(s) s become too large, the policy is reduced to the behavior policy µ, which corresponds to underfitting. Our goal is to find the optimal hyperparameters of α(s) that balances between two extremes. To this end, we optimize the hyperparameters on the valid MDP c Mvalid: α = arg max α Eπ α,c Mvalid and deploy π α to the real environment. Note the similarity between our framework for batch RL Eq. (8-9), and the hyperparameter optimization in supervised learning Eq. (6-7). Illustrative Example Figure 1 highlights our approach for batch RL using a synthetic example. We constructed a single instance of random MDP M with |S| = 20 and |A| = 4. We collected batch data D consisting of 100 episodes with maximum time step 50 using the behavior policy µ = 0.7π +0.3πunif, where π is the optimal policy of M and πunif is the uniform random policy. As a baseline, πBasic RL is obtained by computing the optimal policy of the MLE MDP using the entire data in D. We divide D into two mutually exclusive sets Dtrain and Dvalid of same number of trajectories, and π α is computed via Eq. (8) on the c Mtrain, the MLE MDP using the data in Dtrain. Finally, ρM(π) denotes the (reward) performance of policy π on MDP M {M, c Mtrain, c Mvalid}, i.e. ρM(π) = P π,M [P t=0 γtrt]. Figure 1a visualizes the result when we used a global scalar hyperparameter, i.e. α(s) = α s. As α increases, the performance of π α in c Mtrain monotonically decreases (cyan) since α = 0 yields the optimal policy for the c Mtrain. In contrast, the performance of π α in c Mvalid (green) shows an expected trend, where too large or too small value of α deteriorates the performance. Note that the performance trend in c Mvalid is strongly correlated with the trend in the true model M. This justifies the use of c Mvalid for selecting α in order to perform well on (unknown) true model M. On the other hand, note that π α underperforms πBasic RL in the true model when α = 0. This is expected, since π α is obtained from c Mtrain using less amount of data. However, when α = arg maxα ρc Mvalid(π α) is used, π α significantly outperforms πBasic RL. This result supports that batch RL can benefit from hyperparameter tuning approach, a common practice in supervised learning. We further extended the experiment setting by allowing state-dependent hyperparameters α(s). Figure 1b demonstrates the result when using two hyperparameters instead of one. We set α(sodd) = α1 to the states of odd index and α(seven) = α2 to the states of even index. Note that there is a strong overall correlation between ρc Mvalid(π α) and ρM(π α), and the optimal hyperparameter does not lie on the dotted yellow line that corresponds to the case of the global scalar hyperparameter. This result supports that the state-dependent hyperparameters can be advantageous over the global scalar hyperparameter. 5.2. Gradient-based Hyperparameter Optimization Tuning the state-dependent hyperparameter α(s) via blackbox optimization (e.g. grid search or random search) is intractable due to the curse of dimensionality even for a handful of states. Thus, BOPAH adopts gradient-based hyperparameter optimization, whose analytical form is provided as follows: Theorem 5.1. Suppose that the state-dependent function αξ(s) for Eq. (8) is parameterized by ξ 1. Then, the hy- 1For example, if s αξ(s) = ξ such that ξαξ(s) = 1, this reduces to single hyperparameter α. On the other hand, if αξ(s) = Batch Reinforcement Learning with Hyperparameter Gradients pergradient ξρc Mvalid(π ξ) = ξEπ ξ ,c Mvalid P t=0 γtrt is given by: ξρc Mvalid(π ξ) = Eπ ξ ,c Mvalid t=0 γtχξ(st) π ξ arg max π Eπ, c Mtrain t=0 γt rt αξ(st)KLπ,µ st # ξ e Q π ξ c Mtrain(s, a) Eπ ξ , c Mtrain t=1 γt ξαξ(st)KL π ξ ,µ st # e Q π ξ c Mtrain(s, a) Eπ ξ , c Mtrain t=0 γt rt αξ(st)KLπ,µ st # Q π ξ c Mvalid (s, a) Eπ ξ , c Mvalid χξ(s) 1 αξ(s)2 cova π ξ h αξ(s) ξ e Q π ξ c Mtrain(s, a) ξαξ(s) e Q π ξ c Mtrain (s, a), Q π ξ c Mvalid (s, a) i where cova[f(a), g(a)]i = E[fi(a)g(a)] E[fi(a)]E[g(a)] denotes element-wise covariance between the vector f( ) and the scalar g( ). In the above, π ξ can be obtained by KL-regularized policy iteration, and ξ e Q π ξ c Mtrain(s, a) and ξρc Mvalid(π ξ) can be computed by a standard policy evaluation technique with auxiliary reward functions R1(s, a) ξαξ(s)KL π ξ ,µ s and R2(s, a) χ(s), respectively. Finally, we can optimize the hyperparameters via the hypergradient by iterating the following until convergence: ξ ξ + η ξρc Mvalid(π ξ) where ξ is a parameter for the state-dependent function αξ(s), and η is a learning rate. This completes the description of BOPAH, which iteratively alternates between policy optimization and gradient-based hyperparameter optimization. Remark While the definition of χ(s) from Theorem 5.1 is hard to interpret as is, it can be reduced to simple expression when αξ(s) is state-independent and provides an additional intuition on the behavior of the hypergradient. When αξ(s) is state-independent, i.e. αξ(s) = ξ, the auxiliary value function ξ e Q π ξ c Mtrain(s, a) becomes a simple discounted sum of KL-divergences: ξ e Q π ξ c Mtrain(s, a) = Eπ ξ ,c Mtrain t=1 γt KL π ξ ,µ st which further reduces χ(s) by canceling out the regularization term of the KL-regularized value function to become: ξs such that ξsi αξ(sj) = 1(si = sj), this corresponds to the fully state-dependent hyperparameters α(s). χξ(s) = 1 αξ(s)cova π ξ h Q π ξ c Mtrain (s, a), Q π ξ c Mvalid (s, a) i (12) where Q π ξ c Mtrain (s, a) = Eπ ξ ,c Mtrain [P t=0 γtrt], the unregularized value function under train MDP. The auxiliary reward function χ(s) is now a negative covariance between unregularized value functions of train and valid MDPs. Consider a scenario where a learned policy only exploits reliable state-action pairs (i.e. state-action pairs that appear frequently in the dataset). In this case, two unregularized value functions Q π ξ c Mtrain (s, a) and Q π ξ c Mvalid (s, a) should be similar under the policy distribution, resulting in positive covariance. Then, the hypergradient, which is a discounted sum of negative covariance, will become negative, the alpha will decrease, and the policy will be less regularized to explore more state-action pairs. On the other hand, when the policy tries to explore uncertain state-action pairs where two unregularized value functions differ, the hypergradient descent will regularize policy more to finally converge to the appropriate level of regularization. 6. Actor-Critic BOPAH In this section, we extend BOPAH to control tasks with continuous state and action spaces via practical approximations to KL-regularized policy iteration and hypergradient computation. We achieve this goal by adopting the Actor-Critic framework, where we train the parametric models of policies (i.e. actor), KL-regularized value functions (i.e. critic), and state-dependent hyperparameters. The resulting algorithm, Actor-Critic BOPAH (AC-BOPAH), thus performs alternating updates of the three parametric models. 6.1. Parametric Model of State-dependent Hyperparameters Although in principle, we could use any parametric model to represent state-dependent hyperparameters, we focus on analyzing the linear model for αξ(s), which particularly allows for stable training of KL-regularized value functions. To see this, suppose i=1 ξiφi(s) = ξ φ(s) (13) where ξ = [ξ1, . . . , ξdξ] , and φ : S Rdξ is a feature function of states such that ξ φ(s) 0 for all s (e.g. ξ 0 and φi(s) s are RBFs). The feature function φ(s) is predefined and fixed, and only ξ is optimized during training. This linear parameterization leads to the following decom- Batch Reinforcement Learning with Hyperparameter Gradients position of the KL-regularized value function: e Qπ c Mtrain(s, a) = E r0 + t=1 γt rt αξ(st)KLπ,µ st | {z } Qπ θtrain(s, a) t=1 γt φ(s)KLπ,µ st | {z } Qπ θKL train(s, a) Note in Eq. (14) that both Qπ θtrain(s, a) and Qπ θKL train(s, a) are not dependent on ξ but only on π. Thus, we could train the separate models for Qπ θtrain(s, a) and Qπ θKL train(s, a) instead of the single model for e Qπ c Mtrain(s, a), making the training more stable since they are not affected by the change in the hyperparameters ξ. Furthermore, the auxiliary value function ξ e Q π ξ c Mtrain(s, a) in Theorem 5.1 is also directly derived using Qπ θKL train(s, a) without any further introduction of parameterized functions since ξαξ(s) = φ(s): ξ e Qπ c Mtrain(s, a) = Qπ θKL train(s, a) (15) We also train the action-value critic for the validation-set MDP, defined in the standard way: Qπ θvalid(s, a) Eπ,c Mvalid Finally, we define state-value critics V π ψtrain(s), VψKL train(s), and V π ψvalid(s) similarly to Eq. (14) and Eq. (16), which completes our parameterization of the value functions and the hyperparameters for AC-BOPAH. As for the actor πω(a|s), we use the Gaussian policy with its mean and covariance parameterized by neural networks. 6.2. Objective Functions We now present the objective functions to train each parametric model for the actor and the critic with fixed hyperparameters. First, the parameters for the reward value functions, {ψtrain, θtrain, ψvalid, θvalid}, are trained by minimizing the squared residual errors: for each data {train, valid}, J(ψdata) = Es Ddata a πω h Vψdata(s) Qθdata(s, a) 2i J(θdata) = E(s,a,r,s ) Ddata h Qθdata(s, a) r γV ψdata(s ) 2i where ψdata is an exponential moving average of ψdata (i.e. soft target update). Similarly, the parameters for the KL value functions {ψKL train, θKL train} are trained by minimizing: J(ψKL train) = Es Dtrain a πω VψKL train(s) QθKL train(s, a) + φ(s)KLπ,µ s 2 2 J(θKL train) = E (s,a,s ) Dtrain QθKL train(s, a) γV ψKL train(s ) 2 2 where ψKL train is an exponential moving average of ψKL train. Then, the policy parameters are optimized by minimizing the expected KL-divergence of Eq. (5), which yields: J(ω) = Es Dtrain a πω h αξ(s)KLπ,µ s Qπ θtrain (s, a) ξ Qπ θKL train(s, a) i Here, we use the analytic formula for computing the KLdivergence KLπ,µ s and adopt the reparameterization trick to the Gaussian policy when computing the gradient in order to reduce the variance of stochastic gradients. We also perform conservative training with bootstrapped Q and apply gradient penalty of critic networks. Without them, the algorithm is prone to divergence during training due to the overestimation of the uncertain state-action. Note that other batch RL algorithms (Kumar et al., 2019) use similar conservative estimation techniques for stabilization of training. The gradient penalty constrains the Lipschitz constant of the critic, which prevents outputting extremely high value for the uncertain region that can be encountered by weak behavior-regularization during hyperparameter optimization. More technical details for the experiments can be found in the Appendix E. Iterative optimization of {J(ψdata), J(θdata), J(ψKL train), J(θKL train)} and J(ω) results in an actor-critic algorithm that performs approximate KL-regularized policy iteration. We will denote this actor-critic algorithm as KLAC in the experiments when we use fixed hyperparameters. 6.3. Clipped Importance Sampling for Hypergradients Finally, we compute the approximate hypergradients by exploiting the current actor and critics as the approximate solutions of the KL-regularized MDP with respect to the current hyperparameter ξ. From Eq. (10) and Eq. (14-15), ξρc Mvalid(π) = Eπ,c Mvalid t=0 γtχξ(st) s.t. χξ(s) = 1 ξ φ(s) 2 cova π (ξ φ(s))Qπ θKL train(s, a) φ(s) Qπ θtrain (s, a) + ξ Qπ θKL train(s, a) , Qπ θvalid(s, a) Here, for immediate (off-policy) policy evaluation of π with respect to the auxiliary reward χξ(st) using the validation trajectory collected by µ, we adopt a clipped importance sampling. For the validation dataset Dvalid = {τ1, . . . , τNτ } such that τn = {(sn t , an t , rn t , s n t )}Tn t=0, we Batch Reinforcement Learning with Hyperparameter Gradients 1.45 1.50 1.55 π π/2 0 π/2 π θ Learned αξ(s) π π/2 0 π/2 π θ State Visitation Figure 2. Example of the learned αξ(s) in Pendulum-v0. The symbol denotes the representative states for center of RBFs, obtained from the cover tree algorithm. estimate the approximate hypergradient as follows: ξρc Mvalid(π) 1 t=0 γtwn 0:tχξ(st) (18) s.t. wn 0:t clip π(an k|sn k) µ(an k|sn k) , wmin, wmax and update the hyperparameter via hypergradient ascent. In practice, we alternate between Hf steps of optimizing the actor-critic (inner-problem) and one step of optimizing the hyperparameter (outer-problem). We refer this algorithm as AC-BOPAH. Illustrative Example Figure 2 visualizes αξ( ) optimized by AC-BOPAH in Pendulum-v0. We sampled trajectories of 103 episodes using a suboptimal behavior policy which was partially trained for only 104 steps by soft actor-critic (SAC) (Haarnoja et al., 2018). We selected 10 representative states within the batch data via the cover tree algorithm (Beygelzimer et al., 2006) and used them for the basis of each RBF φi(s) in α(s). Finally, we run AC-BOPAH for 106 steps. As we can inspect from the left heatmap in Figure 2, the optimized αξ( ) has lower values in the densely collected state region while it has higher values in the sparsely collected state region. This is a desirable result: we can be safely deviate from the behavior policy for optimization in experience-rich areas but should be conservative and fall back to the behavior policy in experience-sparse areas of the state space. 7. Experiments 7.1. Model-based BOPHA on Random MDPs In order to probe how safely and efficiently BOPAH can improve performance over the behavior policy with respect to the varying number of trajectories and optimality of the behavior policy, we conducted repeated experiments using randomly generated MDPs. The experimental protocol follows that of (Laroche et al., 2019), and details can be found in the Appendix F. In essence, we repeated 10k runs, where each random MDP M was created with |S| = 50, |A| = 4, γ = 0.95, where the maximum episode length was set to 50. The ζ-optimal behavior policy µ denotes ρM(µ) = ζρM(π ) + (1 ζ)ρM(πunif). We compare the model-based BOPAH with four algorithms: (1) Basic RL (simple baseline appeared in 5.1), (2) Rewardadjusted MDP (Ra MDP) (Petrik et al., 2016), (3) Robust MDP (Nilim & El Ghaoui, 2005; Iyengar, 2005), (4) SPIBB (Laroche et al., 2019), using various sizes of trajectories and two optimalities of behavior policy (near-optimal and nearrandom policies). For each run, we measure the normalized performance of π: ρM(π) = ρM(π) ρM(µ) ρM(π ) ρM(µ) ( , 1], which measures how much the algorithm improves its performance over the behavior policy. Finally, we report the mean (normalized) performance and the conditional value at risk performance (CVa R). The x%-CVa R denotes the mean (normalized) performance of the worst x% runs. Figure 3a-3b presents the result when the behavior policy is near-optimal (ζ = 0.9), where the behavior policy is nearly deterministic, thus the collected trajectories could cover only part of the entire state-action space. On the other hand, Figure 3c-3d represents the results when the behavior policy is more randomized (ζ = 0.5), thus the collected trajectories could cover the entire state-action space fairly well. In both settings, BOPAH with state-dependent hyperparameters consistently matched or exceeded other algorithms, which highlights effectiveness and robustness of our algorithm. 7.2. AC-BOPAH on Continuous Control Tasks In this experiment, we evaluate the effectiveness of ACBOPAH on continuous control tasks, using the Mu Jo Co environments in the Open AI gym (Todorov et al., 2012; Brockman et al., 2016). We first obtained the behavior policy by running SAC (Haarnoja et al., 2018) for half-million steps and then prepared the dataset consisting of 103 trajectories shared by all algorithms. For AC-BOPAH, we held out 20% of trajectories as the validation set. We compare our AC-BOPAH with two behavior cloning baselines, one with the Gaussian policy (BC) and the other one with the variational autoencoder (VAE-BC). We also compare with two state-of-the-art batch deep RL algorithms for continuous control problems, BCQ (Fujimoto et al., 2019) and BEARQL (Kumar et al., 2019). We used their published code and hyperparameters (Φ = 0.05 for BCQ and ϵ = 0.05 for Batch Reinforcement Learning with Hyperparameter Gradients 101 102 103 number of trajectories in D normalized performance Mean performance (behavior optimality=0.9) 101 102 103 number of trajectories in D normalized performance CVa R 5% (behavior optimality=0.9) 101 102 103 number of trajectories in D normalized performance Mean performance (behavior optimality=0.5) 101 102 103 number of trajectories in D normalized performance CVa R 5% (behavior optimality=0.5) Behavior Optimal BOPAH BOPAH (single α) SPIBB Robust MDP Ra MDP Basic RL Figure 3. The result from random MDP experiments. 0.0 1.0 2.0 3.0 4.0 5.0 Timesteps (106) Average Return Half Cheetah-v2 0.0 1.0 2.0 3.0 4.0 5.0 Timesteps (106) Average Return 0.0 1.0 2.0 3.0 4.0 5.0 Timesteps (106) Average Return Walker2d-v2 AC-BOPAH AC-BOPAH (single α) KLAC BCQ BEAR-QL BC VAE-BC Behavior Policy Figure 4. The results of continuous control experiments averaged over 5 trials, which are moving-averaged with a window size of 20. The shaded area represents the standard error. BEAR-QL) therein for obtaining experimental results.2 We report two versions of our algorithm, AC-BOPAH (single α) that uses a global hyperparameter α and AC-BOPAH that uses state-dependent hyperparameter αξ(s) with |ξ| = 21. Both versions are initialized to start from αξ(s) = 102 s. KLAC is the KL-regularized actor-critic with α = 102 held constant for all tasks. As presented in Figure 4, AC-BOPAH consistently outperformed the state-of-the-art algorithms by large margins. While the KLAC was on a par with other algorithms, ACBOPAH made further improvement by optimizing the hyperparameters using the held-out validation set. AC-BOPAH (with state-dependent KL-regularization) shows clear improvement over the constant α version, except for the Half Cheetah-v2 domain where the latter already achieved near-optimal performance. 8. Conclusion In this work, we presented the generalized KLregularization and the BOPAH framework for batch RL, 2For more thorough comparison, we also conducted additional experiments with hyperparameter grid-search for BCQ and BEAR as in (Wu et al., 2019), i.e. Φ {0.005, 0.015, 0.05, 0.15, 0.5} for BCQ and ϵ {0.015, 0.05, 0.15, 0.5, 1.5} for BEAR-QL. We observed improvement only for BCQ in Halfcheetah (with Φ = 0.5), which managed to perform better than AC-BOPAH. However, such tuning requires access to the true environment, which is not feasible in the batch RL setting. which propose the optimization of state-dependent regularization via the hypergradient ascent. We provided a formal analysis that motivates the objective used in BOPAH, and presented two concrete versions, (1) model-based BOPAH that assumes tabular environment and computes exact hypergradients, and (2) AC-BOPAH that uses actor-critic architecture to compute approximate hypergradients in more challenging continuous tasks. We empirically demonstrated that both model-based BOPAH and AC-BOPAH outperform the state-of-the-art algorithms, supporting the hypothesis that batch RL can significantly benefit from hyperparameter optimization. While we introduced BOPAH with the KL-regularization for batch RL in this work, we believe that our BOPAH framework can be extended to other related problems in RL with limited experience data, which shall be interesting direction for future work. Acknowledgments This work was supported by the National Research Foundation (NRF) of Korea (NRF-2019R1A2C1087634 and NRF2019M3F2A1072238), the Ministry of Science and Information communication Technology (MSIT) of Korea (IITP No. 2020-0-00940, IITP 2019-0-00075 and IITP No. 20170-01779 XAI), and POSCO. Batch Reinforcement Learning with Hyperparameter Gradients Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. In International Conference on Machine Learning, pp. 22 31, 2017. Agarwal, R., Schuurmans, D., and Norouzi, M. An optimistic perspective on offline reinforcement learning. International Conference on Machine Learning, 2020. Bengio, Y. Gradient-based optimization of hyperparameters. Neural Computation, 12(8):1889 1900, August 2000. Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13:281 305, February 2012. ISSN 1532-4435. Beygelzimer, A., Kakade, S., and Langford, J. Cover trees for nearest neighbor. In International Conference on Machine Learning, pp. 97 104, 2006. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016. Fedus*, W., Rosca*, M., Lakshminarayanan, B., Dai, A. M., Mohamed, S., and Goodfellow, I. Many paths to equilibrium: GANs do not need to decrease a divergence at every step. In International Conference on Learning Representations, 2018. Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pp. 202 211, 2016. Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052 2062, 2019. Galashov, A., Jayakumar, S., Hasenclever, L., Tirumala, D., Schwarz, J., Desjardins, G., Czarnecki, W. M., Teh, Y. W., Pascanu, R., and Heess, N. Information asymmetry in KL-regularized RL. In International Conference on Learning Representations, 2019. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5769 5779, 2017. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1856 1865, 2018. Hasselt, H. V. Double q-learning. In Advances in Neural Information Processing Systems, pp. 2613 2621, 2010. Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 30(2):257 280, May 2005. Janner, M., Fu, J., Zhang, M., and Levine, S. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pp. 12498 12509, 2019. Jaques, N., Ghandeharioun, A., Shen, J. H., Ferguson, C., Lapedriza, A., Jones, N., Gu, S., and Picard, R. W. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. Co RR, abs/1907.00456, 2019. Kappen, H. J., G omez, V., and Opper, M. Optimal control as a graphical model inference problem. Machine Learning, 87(2):159 182, 2012. Kodali, N., Abernethy, J., Hays, J., and Kira, Z. On convergence and stability of gans, 2017. Kumar, A., Fu, J., Tucker, G., and Levine, S. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pp. 11761 11771, 2019. Laroche, R., Trichelair, P., and Des Combes, R. T. Safe policy improvement with baseline bootstrapping. In International Conference on Machine Learning, pp. 3652 3661, 2019. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016. Maclaurin, D., Duvenaud, D., and Adams, R. P. Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113 2122, 2015. Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054 1062, 2016. Nilim, A. and El Ghaoui, L. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, September 2005. Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems, pp. 4026 4034, 2016. Pedregosa, F. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pp. 737 746, 2016. Batch Reinforcement Learning with Hyperparameter Gradients Petrik, M., Ghavamzadeh, M., and Chow, Y. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pp. 2298 2306, 2016. Pfau, D. and Vinyals, O. Connecting generative adversarial networks and actor-critic methods, 2016. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889 1897, 2015. Schulman, J., Chen, X., and Abbeel, P. Equivalence between policy gradients and soft q-learning, 2017. Siegel, N., Springenberg, J. T., Berkenkamp, F., Abdolmaleki, A., Neunert, M., Lampe, T., Hafner, R., Heess, N., and Riedmiller, M. Keep doing what worked: Behavior modelling priors for offline reinforcement learning. In International Conference on Learning Representations, 2020. Sun, W., Gordon, G. J., Boots, B., and Bagnell, J. Dual policy iteration. In Advances in Neural Information Processing Systems, pp. 7059 7069, 2018. Todorov, E. Linearly-solvable markov decision problems. In Advances in Neural Information Processing Systems, pp. 1369 1376, 2007. Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, pp. 5026 5033, 2012. Wu, Y., Tucker, G., and Nachum, O. Behavior regularized offline reinforcement learning. Ar Xiv, abs/1911.11361, 2019.