# brpo_batch_residual_policy_optimization__7d3395d7.pdf BRPO: Batch Residual Policy Optimization Sungryull Sohn ,1,2 , Yinlam Chow ,1 , Jayden Ooi ,1 , Ofir Nachum1 , Honglak Lee1,2 , Ed Chi1 and Craig Boutilier1 1Google Research 2University of Michigan srsohn@umich.edu, {yinlamchow, jayden, ofirnachum, honglak, edchi, cboutilier}@google.com In batch reinforcement learning (RL), one often constrains a learned policy to be close to the behavior (data-generating) policy, e.g., by constraining the learned action distribution to differ from the behavior policy by some maximum degree that is the same at each state. This can cause batch RL to be overly conservative, unable to exploit large policy changes at frequently-visited, highconfidence states without risking poor performance at sparsely-visited states. To remedy this, we propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance. We show that BRPO achieves the state-ofthe-art performance in a number of tasks. 1 Introduction Deep reinforcement learning (RL) methods are increasingly successful in domains such as games [Mnih et al., 2013], and robotic manipulation [Nachum et al., 2019]. Much of this success relies on the ability to collect new data through online interactions with the environment during training, often relying on simulation. Unfortunately, this approach is impractical in many real-world applications where faithful simulators are rare, and in which active data collection through interactions with the environment is costly, time consuming, and risky. Batch (or offline) RL [Lange et al., 2012] is an emerging research direction that aims to circumvent the need for online data collection, instead learning a new policy using only offline trajectories generated by some behavior policy (e.g., the currently deployed policy in some application domain). In principle, any off-policy RL algorithm (e.g., DDPG [Lillicrap et al., 2015], DDQN [Hasselt et al., 2016]) may be used in this batch (or more accurately, offline ) fashion; but in practice, such methods have been shown to fail to learn when presented with arbitrary, static, off-policy data. This can arise for several reasons: lack of exploration [Lange et al., 2012], generalization error on out-of-distribution samples in value Equal contribution estimation [Kumar et al., 2019], or high-variance policy gradients induced by covariate shift [Mahmood et al., 2014]. Various techniques have been proposed to address these issues, many of which can be interpreted as constraining or regularizing the learned policy to be close to the behavior policy [Fujimoto et al., 2018; Kumar et al., 2019] (see further discussion below). While these batch RL methods show promise, none provide improvement guarantees relative to the behavior policy. In domains for which batch RL is well-suited (e.g., due to the risks of active data collection), such guarantees can be critical to deployment of the resulting RL policies. In this work, we use the well-established methodology of conservative policy improvement (CPI) [Kakade and Langford, 2002] to develop a theoretically principled use of behavior-regularized RL in the batch setting. Specifically, we parameterize the learned policy as a residual policy, in which a base (behavior) policy is combined linearly with a learned candidate policy using a mixing factor called the confidence. Such residual policies are motivated by several practical considerations. First, one often has access to offline data or logs generated by a deployed base policy which is known to perform reasonably well. The offline data can be used by an RL method to learn a candidate policy with better predicted performance, but if confidence in parts of that prediction is weak, relying on the base policy may be desirable. The base policy may also incorporate soft business constraints or some form of interpretability. Our residual policies blend the two in a learned, non-uniform fashion. When deploying a new policy, we use the CPI framework to derive updates that learn both the candidate policy and the confidence that jointly maximize a lower bound on performance improvement relative to the behavior policy. Crucially, while traditional applications of CPI, such as TRPO [Schulman et al., 2015], use a constant or state-independent confidence, our performance bounds and learning rules are based on state-actiondependent confidences this gives rise to bounds that are less conservative than their CPI counterparts. In Sec. 2, we formalize residual policies and in Sec. 3 analyze a novel difference-value function. Sec. 4 holds our main result, a tighter lower bound on policy improvement for our residual approach (vs. CPI and TRPO). We derive the BRPO algorithm in Sec. 5 to jointly learn the candidate policy and confidence; experiments in Sec. 6 show its effectiveness. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 2 Preliminaries We consider a Markov decision process (MDP) M = S, A, R, T, P0 , with state space S, action space A, reward function R, transition kernel T, and initial state distribution P0. A policy π interacts with the environment, starting at s0 P0. At step t, the policy samples an action at from a distribution π( |st) over A and applies. The environment emits a reward rt = R(st, at) [0, Rmax] and next state st+1 T( |st, at). In this work, we consider discounted infinite-horizon problems with discount factor γ [0, 1). Let = {π : S A [0, 1], P a π(a|s) = 1} be the set of Markovian stationary policies. The expected (discounted) cumulative return of policy π , is Jπ := ET,π[P t=0 γt R(st, at) | s0 P0]. Our aim is to find an optimal policy π arg maxπ Jπ. In reinforcement learning (RL), we must do so without knowledge of R, T, using only trajectory data generated from the environment (see below) or access to a simulator (see above). We consider pure offline or batch RL, where the learner has access to a fixed data set (or batch) of state-actionsreward-next-state samples B = {(s, a, r, s )}, generated by a (known) behavior policy β( |s). No additional data collection is permitted. We denote by dβ the γ-discounted occupation measure of the MDP w.r.t. β. In this work, we study the problem of residual policy optimization (RPO) in the batch setting. Given the behavior policy β(a|s), we would like to learn a candidate policy ρ(a|s) and a state-action confidence λ(s, a), such that the final residual policy π(a|s) = (1 λ(s, a)) β(a|s) + λ(s, a) ρ(a|s) maximizes total return. As discussed above, this type of mixture allows one to exploit an existing, well-performing behavior policy. Intuitively, λ(s, a) should capture how much we can trust ρ at each s, a pair, given the available data. To ensure that the residual policy is a probability distribution at every state s S, we constrain the confidence λ to lie in the set Λ(s) = {λ : S A [0, 1] : P a λ(s, a) (β(a|s) ρ(a|s)) = 0} . Related work. Similar to the above policy formulation, CPI [Kakade and Langford, 2002] also develops a policy mixing methodology that guarantees performance improvement when the confidence λ is a constant. However, CPI is an online algorithm, and it learns the candidate policy independently of (not jointly with) the mixing factor; thus, extension of CPI to offline, batch setting is unclear. Other existing work also deals with online residual policy learning without jointly learning mixing factors [Johannink et al., 2019; Silver et al., 2018]. Common applications of CPI may treat λ as a hyper-parameter, which specifies the maximum totalvariation distance between the learned and behavior policy distributions (see standard proxies in [Schulman et al., 2015; Pirotta et al., 2013] for details). Batch-constrained Qlearning (BCQ) [Fujimoto et al., 2018; Fujimoto et al., 2019] incorporates the behavior policy when defining the admissible action set in Q-learning for selecting the highestvalued actions that are similar to data samples in the batch. BEAR [Kumar et al., 2019] is motivated as a means to control the accumulation of out-of-distribution value errors; but its main algorithmic contribution is realized by adding a regularizer to the loss that measures the kernel maximum mean dis- crepancy (MMD) [Gretton et al., 2007] between the learned and behavior policies similar to KL-control [Jaques et al., 2019]. Algorithms such as SPI [Ghavamzadeh et al., 2016] and SPIBB [Laroche and Trichelair, 2017] bootstraps the learned policy with the behavior policy when the uncertainty in the update for current state-action pair is high, where the uncertainty is measured by the visitation frequency of stateaction pairs in the batch data. While these methods work well in some applications it is unclear if they have any performance guarantees. 3 The Difference-value Function We begin by defining and characterizing the difference-value function, a concept we exploit in the derivation of our batch RPO method in Secs. 4 and 5. For any s S, let Vπ(s) and Vβ(s) be the value functions induced by policies π and β, respectively. Using the structure of the residual policy, we establish two characterizations of the difference-value function Vπ,β(s) := Vπ(s) Vβ(s). Lemma 1. Let Aπ(s, a) := Qπ(s, a) Vπ(s) be the advantage function w.r.t. residual policy π, where Qπ is the state-action value. The difference-value is Vπ,β(s) = ET,β[P t=0 γt ˆAπ,β,ρ,λ(st) | s0 = s], where ˆAβ,ρ,λ(s) = P a A β(a|s) λ(s, a) ρ(a|s) β(a|s) β(a|s) Aπ(s, a) is the residual reward that depends on λ and difference of candidate policy ρ and behavior policy β. This result establishes that the difference value is essentially a value function w.r.t. the residual reward. Moreover, it is proportional to the advantage of the target policy, the confidence, and the difference of policies. While the difference value can be estimated from behavior data batch B, this formulation requires knowledge of the advantage function Aπ w.r.t. the target policy, which must be re-learned at every πupdate in an off-policy fashion. Fortunately, we can show that the difference value can also be expressed as a function of the advantage w.r.t. the behavior policy β: Theorem 2. Let Aβ(s, a) := Qβ(s, a) Vβ(s) be the advantage function induced by β, in which Qβ is the stateaction value. The difference-value is given by Vπ,β(s) = ET,π [P t=0γt Aβ,ρ,λ(st)|s0 = s], where Aβ,ρ,λ(s) = X a A β(a|s) λ(a|s) ρ(a|s) β(a|s) β(a|s) Aβ(s, a) is the residual reward that depends on λ and difference of candidate policy ρ and behavior policy β. In our RPO approach, we exploit the nature of the difference-value function to solve the maximization w.r.t. the confidence and candidate policy: (λ (s, ), ρ ( |s)) arg maxλ Λ(s),ρ Vπ(s), s S. Since λ(s, ) = 0 implies Vπ,β(s) = 0, the optimal difference-value function V (s) := maxλ Λ(s),ρ Vπ(s) is always lowerbounded by 0. We motivate computing (λ, ρ) with the above difference-value formulation rather than as a standard RL problem as follows. In the tabular case, optimizing (λ, ρ) with either formulation gives an identical result. However, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) both the difference-value function in Theorem 2 and the standard RL objective require sampling data generated by the updated policy π. In the batch setting, when fresh samples are unavailable, learning (λ, ρ) with off-policy data may incur instability due to high generalization error [Kumar et al., 2019]. While this can be alleviated by adopting the CPI methodology, applying CPI directly to RL can be overly conservative [Schulman et al., 2015]. By contrast, we leverage the special structure of the difference-value function (e.g., nonnegativity) below, using this new formulation together with CPI to derive a less conservative RPO algorithm. 4 Batch Residual Policy Optimization We now develop an RPO algorithm that has stable learning performance in the batch setting and performance improvement guarantees. For the sake of brevity, in the following we only present the main results on performance guarantees of RPO. Proofs of these results can be found in the appendix of the extended paper. We begin with the following baseline result, directly applying Corollary 1 of the TRPO result to RPO to ensure the residual policy π performs no worse than β. Lemma 3. For any value function U : S R, the differencereturn satisfies Jπ Jβ 1 1 γ e LU,β,ρ,λ 2γ (1 γ)2 ϵU,β,ρ,λ 1 2DKL(β(s) ρ(s)) i , where the surrogate objective and the penalty weight are e LU,β,ρ,λ := E (s,a,s ) dβ λ(s, a) ρ(a|s) β(a|s) β(a|s) U(s, a, s ) , ϵU,β,ρ,λ :=max s |Eπ,T [ U(s, a, s )]|, where U(s, a, s ) := R(s, a) + γU(s ) U(s). When U = Vπ, one has Ea π(s)[ U(s, a, s )] = 0, s S, which implies that the inequality is tight this lemma then coincides Lemma 1. While this CPI result forms the basis of many RL algorithms (e.g., TRPO, PPO), in many cases it is very loose since ϵU,β,ρ,λ is a maximum over all states. Thus, using this bound for policy optimization may be overly conservative, i.e., algorithms which rely on this bound must take very small policy improvement steps, especially when the penalty weight ϵU,β,ρ,λ is large, i.e., |ϵU,β,ρ,λ/(1 γ)| >> |e LU,β,ρ,λ|. While this approach may be reasonable in online settings when collection of new data (with an updated behavior policy β π) is allowed in the batch setting it is challenging to overcome such conservatism. To address this issue, we develop a CPI method that is specifically tied to the difference-value formulation, and uses a state-action-dependent confidence λ(s, a). We first derive the following theorem, which bounds the difference returns that are generated by β and π. Theorem 4. The difference return of (π, β) satisfies Jπ Jβ 1 1 γ L β,ρ,λ γ 1 γ L β,ρ,λ max s0 S L β,ρ,λ(s0) , where the surrogate objective function, regularization, and penalty weight are given by L β,ρ,λ :=E(s,a) dβ λ(s, a) ρ(a|s) β(a|s) β(a|s) Aβ(s, a) L β,ρ,λ :=E(s,a) dβ λ(s, a) |ρ(a|s) β(a|s)| L β,ρ,λ(s0) :=E(s,a) dβ(s0) λ(s, a) |ρ(a|s) β(a|s)| β(a|s) |Aβ(s, a)| respectively, in which dβ(s0) is the discounted occupancy measure w.r.t. β given initial state s0. Unlike the difference-value formulations in Lemma 1 and Theorem 2, which require the knowledge of advantage function Aπ or the trajectory samples generated by π, the lower bound in Theorem 4 is comprised only of terms that can be estimated directly using the data batch B (i.e., data generated by β). This makes it a natural objective function for batch RL. Notice also that the surrogate objective, the regularization, and the penalty weight in the lower bound are each proportional to the confidence and to the relative difference of the candidate and behavior policies. However, the max operator requires state enumeration to compute this lower bound, which is intractable when S is large or uncountable. We address this by introducing a slack variable κ 0 to replace the max-operator with suitable constraints. This allows the bound on the difference return to be rewritten as: Jπ Jβ 1 1 γ L β,ρ,λ minκ L β,ρ,λ(s0), s0 γ (1 γ)2 L β,ρ,λκ. Consider the Lagrangian of the lower bound: L β,ρ,λ 1 γ min κ 0 max η(s) 0, s γ L β,ρ,λ κ (1 γ)2 X s η(s)(κ L β,ρ,λ(s)). To simplify this saddle-point problem, we restrict the Lagrange multiplier to be η(s) = η P0(s) 0, where η 0 is a scalar multiplier. Using this approximation and the strong duality of linear programming [Boyd and Vandenberghe, 2004] over primal-dual variables (κ, η), the saddle-point problem on (λ, ρ, η, κ) can be re-written as Lβ,ρ,λ := max η 0 min κ 0 L β,ρ,λ L β,ρ,λ κ γ 1 γ η κ + ηL β,ρ,λ 1 γ L β,ρ,λ γ 1 γ L β,ρ,λ L β,ρ,λ where L β,ρ,λ = Es P0[L β,ρ,λ(s)]. The equality is based on the KKT condition on (κ, η). Notice that the only difference between the CPI lower bound in Theorem 4 and the objective function Lβ,ρ,λ is that the max operator is replaced by expectation w.r.t the initial distribution. With certain assumptions on the approximation error of the Lagrange multiplier parametrization η(s) P0(s), we can characterize the gap between the original CPI objective function in Theorem 4 and Lβ,ρ,λ. One approach is to look into the KKT condition of the original saddle-point problem and bound the sub-optimality gap introduced by this Lagrange parameterization. Similar derivations can be found in the analysis of approximate linear programming (ALP) algorithms [Abbasi-Yadkori et al., 2019; Farias and Roy, 2003]. Compared with the vanilla CPI result from Lemma 3, there are two characteristics in problem (1) that make the optimization w.r.t. Lβ,ρ,λ less conservative. First, the penalty weight Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) L β,ρ,λ here is smaller than ϵU,β,ρ,λ in Lemma 3, which means that the corresponding objective has less incentive to force ρ to be close to β. Second, compared with entropy regularization in vanilla CPI, here the regularization and penalty weight are both linear in λ Λ [0, 1]|A|; thus, unlike vanilla CPI, whose objective is linear in λ, our objective is quadratic in λ this modification ensures the optimal value is not a degenerate extreme point of Λ.1 5 The BRPO Algorithm We now develop the BRPO algorithm, for which the general pseudo-code is given in Algorithm 1. Recall that if the candidate policy ρ and confidence λ are jointly optimized (ρ , λ ) arg max λ Λ, ρ Lβ,ρ,λ, (2) then the residual policy π (a|s) = (1 λ (s, a))β(a|s) + λ (s, a)ρ (a|s) performs no worse than behavior policy β. Generally, solutions for problem (2) use a form of minorization-maximization (MM) [Hunter and Lange, 2004], a class of methods that also includes expectation maximization. In the terminology of MM algorithms, Lβ,ρ,λ is a surrogate function satisfying the following MM properties: Jπ Jβ Lβ,ρ,λ, Jβ Jβ =Lβ,β,λ =Lβ,ρ,0 = 0, (3) which guarantees that it minorizes the difference-return Jπ Jβ with equality at λ = 0 (with arbitrary ρ) or at ρ = β (with arbitrary λ). This algorithm is also reminiscent of proximal gradient methods. We optimize λ and ρ in RPO with a simple two-step coordinate-ascent. Specifically, at iteration k {0, 1, . . . , K}, given confidence λk 1, we first compute an updated candidate policy ρk, and with ρk fixed, we update λk, i.e., Lβ,ρk,λk Lβ,ρk,λk 1 Lβ,ρk 1,λk 1. When λ and ρ are represented tabularly or with linear function approximators, under certain regularity assumptions (the Kurdyka Lojasiewicz property [Xu and Yin, 2013]) coordinate ascent guarantees global convergence (to the limit point) for BRPO. However, when more complex representations (e.g., neural networks) are used to parameterize these decision variables, this property no longer holds. While one may still compute (λ , ρ ) with first-order methods (e.g., SGD), convergence to local optima is not guaranteed. To address this, we next further restrict the MM procedure to develop closed-form solutions for both the candidate policy and the confidence. The Closed-form Candidate Policy ρ. To effectively update the candidate policy when given the confidence λ Λ, we develop a closed-form solution for ρ. Our approach is based on maximizing the following objective, itself a more conservative version of the CPI lower bound Lβ,ρ,λ: max ρ ˆLβ,ρ,λ := Es dβ λ(s, a)ρ(a|s) β(a|s) γ max{κλ(s), κ|Aβ|λ(s)} 2(1 γ) DKL(ρ β)(s) 1 1 γ , (4) 1For example, when λ is state-dependent (which automatically satisfies the equality constraints in Λ), the linear objective in vanilla CPI makes the optimal value λ ( |s) a 0-1 vector. Especially when 2γ 1 γ Es dβ hq 1 2DKL(β(s) ρ(s)) i is large, then most entries of λ become zero, i.e., π will be very close to β. where κg(s) = (1 + log Eβ[exp(g(a|s)2)]) > 0 for any arbitrary non-negative function g. To show that ˆLβ,ρ,λ in (4) is an eligible lower bound (so that the corresponding ρ-solution is an MM), we need to show that it satisfies the properties in (3). When ρ = β, by the definition of ˆLβ,ρ,λ the second property holds. To show the first property, we first consider the following problem: max ρ 1 1 γ L β,ρ,λ γ 1 γ e L β,ρ,λ e L β,ρ,λ where L β(ρ, λ) is given in Theorem 4, and e L β,ρ,λ =Es dβ hp κλ(s) DKL(ρ β)(s)/2 i , e L β,ρ,λ =Es dβ hq κ|Aβ|λ(s) DKL(ρ β)(s)/2 i . (6) The concavity of p ( ) (i.e., Es dβ[ p Es dβ[( )]) and monotonicity of expectation imply that the objective in (4) is a lower bound of that in (7) below. Furthermore, by the weighted Pinsker s inequality [Bolley and Villani, 2005] P a |g(a|s)(ρ(a|s) β(a|s))| p κg(s)DKL(ρ β)(s)/2, we have: 0 e L β,ρ,λ L β,ρ,λ, 0 e L β,ρ,λ L β,ρ,λ, which implies the objective in (5) is a lower-bound of that in (2) and validates the first MM property. Now recall the optimization problem: maxρ ˆLβ,ρ,λ. Since this optimization is over the state-action mapping ρ, the Interchangeability Lemma [Shapiro et al., 2009] allows swapping the order of Es dβ and maxρ . This implies that at each s S the candidate policy can be solved using: ρ λ arg max ρ( |s) Ea β τλ(s) log ρ(a|s) = arg max ρ( |s) Ea ρ λ(s, a)Aβ τλ(s) log ρ(a|s) where τλ(s) = γ max{κλ(s), κ|Aβ|λ(s)}/(2 2γ) is the state-dependent penalty weight of the relative entropy regularization. By the KKT condition of (7), the optimal candidate policy ρ λ has the form ρ λ(a|s) = β(a|s) exp λ(s,a)Aβ Ea β[exp(λ(s, a )Aβ(s, a )/τλ(s))]. (8) Notice that the optimal candidate policy is a relative softmax policy, which is a common solution policy for many entropyregularized RL algorithms [Haarnoja et al., 2018]. Intuitively, when the mixing factor vanishes (i.e., λ(s, a) = 0), the candidate policy equals to the behavior policy, and with confidence we obtain the candidate policy by modifying the behavior policy β via exponential twisting. The Closed-form Confidence λ. Given candidate policy ρ, we derive efficient scheme for computing the confidence that solves the MM problem: maxλ Λ Lβ,ρ,λ. Recall that this optimization can be reformulated as a concave quadratic program (QP) with linear equality constraints, which has a unique optimal solution [Faybusovich and Moore, 1997]. However, since the decision variable (i.e., the confidence mapping) is infinite-dimensional, solving this QP is intractable without some assumptions about this mapping, To Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) resolve this issue, instead of using the surrogate objective Lβ,ρ,λ in MM, we turn to its sample-based estimate. Specifically, given a batch of data B = {(si, ai, ri, s i)}|B| i=1 generated by the behavior policy β, denote by L β,ρ,λ := 1 1 γ 1 |B| λ i ((ρ β) Aβ)i L β,ρ,λ := 1 1 γ 1 |B| L β,ρ,λ := γ 1 γ 1 |B| λ i (|ρ β| |Aβ|)i the sample-average approximation (SAA) of functions L β,ρ,λ, L β,ρ,λ, and L β,ρ,λ respectively, where (ρ β)Aβ = {(ρ( |si) β( |si))Aβ(si, )}si B, (9) |ρ β||Aβ| = {(|ρ( |si) β( |si)||Aβ(si, )|)}si B, (10) and |ρ β| = {|ρ( |si) β( |si)|}si B are |A| |B|- dimensional vectors, where each element is generated by a state sample from B, and λ = {λ( |si)}si B is a |A| |B|- dimensional decision vector, where each |A|-dimensional element vector corresponds to the confidence w.r.t. state samples in B. Since the expectation in L β,ρ,λ, L β,ρ,λ, and L β,ρ,λ is over the stationary distribution induced by the behavior policy, all the SAA functions are unbiased Monte-Carlo estimates of their population-based counterparts. We now define Lβ,ρ,λ := L β,ρ,λ L β,ρ,λL β,ρ,λ as the SAA-MM objective and use this to solve for the confidence vector λ over the batch samples. Now consider the following maximization problem: λ Λ (ρ β) Aβ, λ γ |B|(1 γ) |ρ β|, λ |ρ β||Aβ|, λ , (11) where the feasible set Λ = {λ [0, 1] : P a A λ(si, a) (ρ(a|si) β(a|si)) = 0, i {1, . . . , |B|}} only imposes constraints on the states that appear in the batch B. This finite-dimensional QP problem can be expressed in the following quadratic form: λ Λ λ (ρ β) Aβ 1 where the symmetric matrix is given by Θβ,ρ := γ(D|Aβ| β,ρ + β,ρ D |Aβ|) β,ρ = |ρ β| |ρ β| , and D|Aβ| = diag({|Aβ|}a A,s B) is a |B| |A| |B| |A|-diagonal matrix whose elements are the absolute advantage function. By definition, Θ is positivesemi-definite, hence the QP above is concave. Using its KKT condition, the unique optimal confidence vector over batch B is given as λ = min{1, max{0, Θ 1 β,ρ((ρ β) Aβ +M β,ρνβ,ρ}}, (12) where Mβ,ρ = blkdiag({[ρ(a|x) β(a|x)]a A}x B) is a |B| |A| |B|-matrix, and the Lagrange multiplier νβ,ρ R|B| w.r.t. constraint Mβ,ρλ = 0 is given by νβ,ρ = (M β,ρΘ 1 β,ρMβ,ρ) 1(M β,ρΘ 1 β,ρ(ρ β) Aβ). (13) We first construct the confidence function λ(s, a) from the confidence vector λ over B, in the following tabular fashion: λ s,a if (s, a) B 0 otherwise . While this construction preserves optimality w.r.t. the CPI objective (2), it may be overly conservative, because the policy equates to the behavior policy by setting λ = 0 at state-action pairs that are not in B (i.e., no policy improvement). To alleviate this conservatism, we propose to learn a confidence function that generalizes to out-of-distribution samples. Learning the Confidence. Given a confidence vector λ corresponding to samples in batch B, we learn the confidence function λφ(s, a) in supervised fashion. To ensure that the confidence function satisfies the constraint: λφ Λ, i.e., P a λφ(s, a)(ρ(a|s) β(a|s)) = 0, λφ(s, a) [0, 1], s, a2, we parameterize it as λφ (s, a) := πφ (a|s) β(a|s) ρ(a|s) β(a|s) , (s, a) S A, (14) where πφ is a learnable policy mapping, such that min{β(a|s), ρ(a|s)} πφ(a|s) max{β(a|s), ρ(a|s)}, s, a. We then learn φ via the following KL distributionfitting objective [Rusu et al., 2015]: (s,a) B πφ(a|s) log πφ(a|s) (1 λ s,a)β(a|s) + λ s,a ρ(a|s) While this approach learns λφ by generalizing the confidence vector to out-of-distribution samples, when πφ is a NN, one challenge is to enforce the constraint: min{β(a|s), ρ(a|s)} πφ(a|s) max{β(a|s), ρ(a|s)}, s, a. Instead, using an in-graph convex optimization NN [Amos and Kolter, 2017], we parameterize λφ with a NN with the following constraintprojection layer Φ : S A before the output: Φ(s) arg min λ R|A| 1 2 a A λa eλ s,a 2, a λa(ρ(a|s) β(a|s)) = 0, 0 λ 1, (15) where, at any s S, the |A|-dimensional confidence vector label {eλ s,a}a A is equal to {λ s,a}a A chosen from the batch confidence vector λ such that s in B is closest to s. Indeed, analogous to the closed-form solution in (12), this projection layer has a closed-form QP formulation with linear constraints: Φ(s) = min{1, max{0, eλ s, + (ρ( |s) β( |s)) µβ,ρ}}, where Lagrange multiplier µβ,ρ is given by µβ,ρ = (ρ( |s) β( |s)) eλ s, / ρ( |s) β( |s) 2. Although the ρ-update is theoretically justified, in practice, when the magnitude of κλ(s) becomes large (due to the conservatism of the weighted Pinsker inequality), the relativesoftmax candidate policy (8) may be too close to the behavior policy β, impeding learning of the residual policy (i.e., π β). To avoid this in practice, we can upper bound the temperature, i.e., κλ(s) min{κmax, κλ(s)}, or introduce a weak temperature-decay schedule, i.e., κλ(s) κλ(s) ϵk, with a tunable ϵ [0, 1). 2If one restricts λφ to be only state-dependent, this constraint immediately holds. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Environment-ε DQN BRPO-C BRPO (ours) BCQ KL-Q SPIBB BC Behavior Policy Acrobot-0.05 -91.2 9.1 -94.6 3.8 -91.9 9.0 -96.9 3.7 -93.0 2.6 -103.5 24.1 -102.3 5.0 -103.9 Acrobot-0.15 -83.1 5.2 -91.7 4.0 -86.1 10.1 -97.1 3.3 -92.1 3.2 -91.1 44.8 -113.1 5.6 -114.3 Acrobot-0.25 -83.4 3.9 -91.2 4.1 -85.3 4.8 -96.7 3.1 -90.0 2.9 -86.0 5.8 -124.1 7.0 -127.2 Acrobot-0.50 -84.3 22.6 -90.9 3.4 -83.7 16.6 -77.8 13.5 -84.5 3.8 -106.8 102.7 -173.7 8.1 -172.4 Acrobot-1.00 -208.9 174.8 -156.8 22.0 -121.7 10.2 -236.0 85.6 -227.5 148.1 -184.8 150.2 -498.3 1.7 -497.3 Cart Pole-0.05 82.7 0.5 220.8 117.0 336.3 122.6 255.4 11.1 323.0 13.5 28.8 1.2 205.6 19.6 219.1 Cart Pole-0.15 299.3 133.5 305.6 95.2 409.9 64.4 255.3 11.4 357.7 84.1 137.7 11.7 151.6 27.5 149.5 Cart Pole-0.25 368.5 129.3 405.1 74.4 316.8 64.1 247.4 128.7 441.4 79.8 305.2 119.7 103.0 20.4 101.9 Cart Pole-0.50 271.5 52.0 358.3 114.1 433.8 93.5 282.5 111.8 314.1 107.0 310.4 128.0 39.7 5.1 37.9 Cart Pole-1.00 118.3 0.3 458.6 51.5 369.0 42.3 194.0 25.1 209.7 48.4 147.1 0.1 22.6 1.5 21.9 Lunar Lander-0.05 -236.4 177.6 35.6 61.7 88.2 32.0 81.5 14.9 84.4 26.3 -200.4 81.7 75.8 17.7 73.7 Lunar Lander-0.15 -215.6 140.4 79.6 29.7 103.9 49.8 80.3 16.8 61.4 39.0 86.1 73.3 76.4 16.6 84.9 Lunar Lander-0.25 2.5 101.3 109.5 40.7 141.6 11.0 83.5 14.6 78.7 48.8 166.0 90.6 57.9 13.1 57.3 Lunar Lander-0.50 -104.6 68.3 42.5 71.4 101.0 39.6 -13.2 44.9 66.2 78.0 -134.6 17.1 -32.6 6.5 -36.0 Lunar Lander-1.00 -65.6 45.9 53.5 44.1 81.8 42.1 -69.1 44.0 -139.2 29.1 -107.1 94.4 -177.4 13.1 -182.6 Table 1: The mean and st. dev. of average return with the best hyperparameter configuration (with the top-2 results boldfaced). Full training curves are given in the appendix. For BRPO-C, the optimal confidence parameter is found by grid search. Algorithm 1 BRPO algorithm Require: B: batch data; Tunable parameter µ [0, 1] 1: for t = 1, . . . , N do 2: Sample mini-batch of transitions (s, a, r, d, s ) B 3: Compute λ from Eq. (12) 4: Update confidence φ by Eq. (15) 5: Update candidate policy ρ λφ by Eq. (8) 6: Construct target critic network Vθ (s ) := (1 µ)Ea β[Qθ (s , a )] + µ maxa Qθ (s , a ) 7: Update θ arg maxθ 1 2 (Qθ(s, a) r γVθ (s ))2 8: Update target network: θ τθ + (1 τ)θ 6 Experimental Results To illustrate the effectiveness of BRPO, we compare against six baselines: DQN [Mnih et al., 2013], discrete BCQ [Fujimoto et al., 2019], KL-regularized Q-learning (KL-Q) [Jaques et al., 2019], SPIBB [Laroche and Trichelair, 2017], Behavior Cloning (BC) [Kober and Peters, 2010], and BRPO-C, which is a simplified version of BRPO that uses a constant (tunable) parameter as confidence weight3. We do not consider ensemble models, thus do not include methods like BEAR [Kumar et al., 2019] among our baselines. CPI is also excluded since it is subsumed by BRPO-C with a grid search on the confidence. It is also generally inferior to BRPO-C because candidate policy learning does not optimize the performance of the final mixture policy. We evaluated on three discrete-action Open AI Gym tasks [Brockman et al., 2016]: Cartpole-v1, Lunarlander-v2, and Acrobot-v1. The behavior policy in each environment is trained using standard DQN until it reaches 75% of optimal performance, similar to the process adopted in related work (e.g., [Fujimoto et al., 2018]). To assess how exploration and the quality of behavior policy affect learning, we generate five sets of data for each task by injecting different random exploration into the same behavior policy. Specifically, we add ε-greedy exploration for ε = 1 (fully random), 0.5, 0.25, 0.15, and 0.05, generating 100K transitions each for batch RL training. All models use the same architecture for a given 3For algorithms designed for online settings, we modify data collection to sample only from offline / batch data. environment details (architectures, hyper-parameters, etc.) are described in the appendix of the extended paper. While training is entirely offline, policy performance is evaluated online using the simulator, at every 1000 training iterations. Each measurement is the average return w.r.t. 40 evaluation episodes and 5 random seeds, and results are averaged over a sliding window of size 10. Table 1 shows the average return of BRPO and the other baselines under the best hyper-parameter configurations in each task setting. Behavior policy performance decreases as ε increases, as expected, and BC matches that very closely. DQN performs poorly in the batch setting. Its performance improves as ε increases from 0.05 to 0.25, due to increased state-action coverage, but as ε goes higher (0.5, 1.0), the state space coverage decreases again since the (near-) random policy is less likely to reach a state far away from the initial state. BCQ, KL-Q and SPIBB follow the behavior policy in some ways, and showing different performance characteristics over the data sets. The underperformance relative to BRPO is more prominent for very low or very high ε, suggesting deficiency due to overly conservative updates or following the behavior policy too closely, when BRPO is able to learn. Since BRPO exploits the statistics of each (s, a) pair in the batch data, it achieves good performance in almost all scenarios, outperforming the baselines. The stable performance and robustness across various scenarios make BRPO an appealing algorithm for batch/offline RL in real-world, where it is usually difficult to estimate the amount of exploration required prior to training, given access only to batch data. 7 Concluding Remarks We have presented Batch Residual Policy Optimization (BRPO) for learning residual policies in batch RL settings. Inspired by CPI, we derived learning rules for jointly optimizing both the candidate policy and state-action dependent confidence mixture of a residual policy to maximize a conservative lower bound on policy performance. BRPO is thus more exploitative in areas of state space that are well-covered by the batch data and more conservative in others. While we have shown successful application of BRPO to various benchmarks, future work includes deriving finite-sample analysis of BRPO, and applying BRPO to more practical batch domains (e.g., robotic manipulation, recommendation systems). Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) References [Abbasi-Yadkori et al., 2019] Y. Abbasi-Yadkori, P. Bartlett, X. Chen, and A. Malek. Large-scale markov decision problems via the linear programming dual. ar Xiv:1901.01992, 2019. [Amos and Kolter, 2017] B. Amos and Z. Kolter. Optnet: Differentiable optimization as a layer in neural networks. In ICML, pages 136 145, 2017. [Bolley and Villani, 2005] F. Bolley and C. Villani. Weighted csisz ar-kullback-pinsker inequalities and applications to transportation inequalities. In Annales de la Facult e des sciences de Toulouse: Math ematiques, volume 14, pages 331 352, 2005. [Boyd and Vandenberghe, 2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004. [Brockman et al., 2016] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016. [Farias and Roy, 2003] P. De Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations research, 51(6):850 865, 2003. [Faybusovich and Moore, 1997] L. Faybusovich and J. Moore. Infinite-dimensional quadratic optimization: interior-point methods and control applications. Applied Mathematics and Optimization, 36(1):43 66, 1997. [Fujimoto et al., 2018] S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without exploration. ar Xiv:1812.02900, 2018. [Fujimoto et al., 2019] Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau. Benchmarking batch deep reinforcement learning algorithms. ar Xiv:1910.01708, 2019. [Ghavamzadeh et al., 2016] M. Ghavamzadeh, M. Petrik, and Y. Chow. Safe policy improvement by minimizing robust baseline regret. In NIPS, 2016. [Gretton et al., 2007] A. Gretton, K. Borgwardt, M. Rasch, B. Sch olkopf, and A. Smola. A kernel approach to comparing distributions. In AAAI, 2007. [Haarnoja et al., 2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML, 2018. [Hasselt et al., 2016] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. In AAAI, 2016. [Hunter and Lange, 2004] D. Hunter and K. Lange. A tutorial on MM algorithms. The American Statistician, 58(1):30 37, 2004. [Jaques et al., 2019] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way offpolicy batch deep reinforcement learning of implicit human preferences in dialog. ar Xiv:1907.00456, 2019. [Johannink et al., 2019] T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. Ojea, E. Solowjow, and S.Levine. Residual reinforcement learning for robot control. In ICRA, pages 6023 6029. IEEE, 2019. [Kakade and Langford, 2002] S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In ICML, pages 267 274, 2002. [Kober and Peters, 2010] J. Kober and J. Peters. Imitation and reinforcement learning. IEEE Robotics & Automation Magazine, 17(2):55 62, 2010. [Kumar et al., 2019] A. Kumar, J. Fu, G. Tucker, and S. Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. ar Xiv:1906.00949, 2019. [Lange et al., 2012] S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. In Reinforcement learning, pages 45 73. Springer, 2012. [Laroche and Trichelair, 2017] R. Laroche and P. Trichelair. Safe policy improvement with baseline bootstrapping. ar Xiv:1712.06924, 2017. [Lillicrap et al., 2015] T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ar Xiv:1509.02971, 2015. [Mahmood et al., 2014] A. Mahmood, H. van Hasselt, and R. Sutton. Weighted importance sampling for off-policy learning with linear function approximation. In NIPS, pages 3014 3022, 2014. [Mnih et al., 2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. ar Xiv:1312.5602, 2013. [Nachum et al., 2019] O. Nachum, M. Ahn, H. Ponte, S. Gu, and V. Kumar. Multi-agent manipulation via locomotion using hierarchical sim2real. Co RL, 2019. [Pirotta et al., 2013] Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policy iteration. In ICML, pages 307 315, 2013. [Rusu et al., 2015] A. Rusu, S. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. ar Xiv:1511.06295, 2015. [Schulman et al., 2015] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889 1897, 2015. [Shapiro et al., 2009] A. Shapiro, D. Dentcheva, and A. Ruszczy nski. Lectures on stochastic programming: modeling and theory. SIAM, 2009. [Silver et al., 2018] T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling. Residual policy learning. ar Xiv:1812.06298, 2018. [Xu and Yin, 2013] Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6(3):1758 1789, 2013. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)