# caql_continuous_action_qlearning__a9262240.pdf CAQL: CONTINUOUS ACTION Q-LEARNING Moonkyung Ryu*, Yinlam Chow*, Ross Anderson, Christian Tjandraatmadja, Craig Boutilier Google Research {mkryu,yinlamchow,rander,ctjandra,cboutilier}@google.com ABSTRACT Value-based reinforcement learning (RL) methods like Q-learning have shown success in a variety of domains. One challenge in applying Q-learning to continuous-action RL problems, however, is the continuous action maximization (max-Q) required for optimal Bellman backup. In this work, we develop CAQL, a (class of) algorithm(s) for continuous-action Q-learning that can use several plugand-play optimizers for the max-Q problem. Leveraging recent optimization results for deep neural networks, we show that max-Q can be solved optimally using mixed-integer programming (MIP). When the Q-function representation has sufficient power, MIP-based optimization gives rise to better policies and is more robust than approximate methods (e.g., gradient ascent, cross-entropy search). We further develop several techniques to accelerate inference in CAQL, which despite their approximate nature, perform well. We compare CAQL with state-of-the-art RL algorithms on benchmark continuous-control problems that have different degrees of action constraints and show that CAQL outperforms policy-based methods in heavily constrained environments, often dramatically. 1 INTRODUCTION Reinforcement learning (RL) has shown success in a variety of domains such as games (Mnih et al., 2013) and recommender systems (RSs) (Gauci et al., 2018). When the action space is finite, valuebased algorithms such as Q-learning (Watkins & Dayan, 1992), which implicitly finds a policy by learning the optimal value function, are often very efficient because action optimization can be done by exhaustive enumeration. By contrast, in problems with a continuous action spaces (e.g., robotics (Peters & Schaal, 2006)), policy-based algorithms, such as policy gradient (PG) (Sutton et al., 2000; Silver et al., 2014) or cross-entropy policy search (CEPS) (Mannor et al., 2003; Kalashnikov et al., 2018), which directly learn a return-maximizing policy, have proven more practical. Recently, methods such as ensemble critic (Fujimoto et al., 2018) and entropy regularization (Haarnoja et al., 2018) have been developed to improve the performance of policy-based RL algorithms. Policy-based approaches require a reasonable choice of policy parameterization. In some continuous control problems, Gaussian distributions over actions conditioned on some state representation is used. However, in applications such as RSs, where actions often take the form of high-dimensional item-feature vectors, policies cannot typically be modeled by common action distributions. Furthermore, the admissible action set in RL is constrained in practice, for example, when actions must lie within a specific range for safety (Chow et al., 2018). In RSs, the admissible actions are often random functions of the state (Boutilier et al., 2018). In such cases, it is non-trivial to define policy parameterizations that handle such factors. On the other hand, value-based algorithms are wellsuited to these settings, providing potential advantage over policy methods. Moreover, at least with linear function approximation (Melo & Ribeiro, 2007), under reasonable assumptions, Q-learning converges to optimality, while such optimality guarantees for non-convex policy-based methods are generally limited (Fazel et al., 2018). Empirical results also suggest that value-based methods are more data-efficient and less sensitive to hyper-parameters (Quillen et al., 2018). Of course, with large action spaces, exhaustive action enumeration in value-based algorithms can be expensive - one solution is to represent actions with continuous features (Dulac-Arnold et al., 2015). The main challenge in applying value-based algorithms to continuous-action domains is selecting optimal actions (both at training and inference time). Previous work in this direction falls into three broad categories. The first solves the inner maximization of the (optimal) Bellman residual loss using global nonlinear optimizers, such as the cross-entropy method (CEM) for QT-Opt (Kalashnikov et al., 2018), gradient ascent (GA) for actor-expert (Lim et al., 2018), and action discretization (Uther & Veloso, 1998; Smart & Kaelbling, 2000; Lazaric et al., 2008). However, these approaches do not guarantee optimality. The second approach restricts the Q-function parameterization so that the optimization problem is tractable. For instance, one can discretize the state and action spaces and use a tabular Q-function representation. However, due to the curse of dimensionality, discretizations must generally be coarse, often resulting in unstable control. Mill an et al. (2002) circumvents this issue by averaging discrete actions weighted by their Q-values. Wire-fitting (Gaskett et al., 1999; III & Klopf, 1993) approximates Q-values piecewise-linearly over a discrete set of points, chosen to ensure the maximum action is one of the extreme points. The normalized advantage function (NAF) (Gu et al., 2016) constructs the state-action advantage function to be quadratic, hence analytically solvable. Parameterizing the Q-function with an input-convex neural network (Amos et al., 2017) ensures it is concave. These restricted functional forms, however, may degrade performance if the domain does not conform to the imposed structure. The third category replaces optimal Q-values with a soft counterpart (Haarnoja et al., 2018): an entropy regularizer ensures that both the optimal Q-function and policy have closed-form solutions. However, the sub-optimality gap of this soft policy scales with the interval and dimensionality of the action space (Neu et al., 2017). Motivated by the shortcomings of prior approaches, we propose Continuous Action Q-learning (CAQL), a Q-learning framework for continuous actions in which the Q-function is modeled by a generic feed-forward neural network.1 Our contribution is three-fold. First, we develop the CAQL framework, which minimizes the Bellman residual in Q-learning using one of several plug-andplay action optimizers. We show that max-Q optimization, when the Q-function is approximated by a deep Re LU network, can be formulated as a mixed-integer program (MIP) that solves max-Q optimally. When the Q-function has sufficient representation power, MIP-based optimization induces better policies and is more robust than methods (e.g., CEM, GA) that approximate the max-Q solution. Second, to improve CAQL s practicality for larger-scale applications, we develop three speed-up techniques for computing max-Q values: (i) dynamic tolerance; (ii) dual filtering; and (iii) clustering. Third, we compare CAQL with several state-of-the-art RL algorithms on several benchmark problems with varying degrees of action constraints. Value-based CAQL is generally competitive, and outperforms policy-based methods in heavily constrained environments, sometimes significantly. We also study the effects of our speed-ups through ablation analysis. 2 PRELIMINARIES We consider an infinite-horizon, discounted Markov decision process (Puterman, 2014) with states X, (continuous) action space A, reward function R, transition kernel P, initial state distribution β and discount factor γ [0, 1), all having the usual meaning. A (stationary, Markovian) policy π specifies a distribution π( |x) over actions to be taken at state x. Let be the set of such policies. The expected cumulative return of π is J(π) := E[P t=0 γtrt | P, R, x0 β, π]. An optimal policy π satisfies π arg maxπ J(π). The Bellman operator F[Q](x, a) = R(x, a) + γ P x X P(x |x, a) maxa A Q(x , a ) over state-action value function Q has unique fixed point Q (x, a) (Puterman, 2014), which is the optimal Q-function Q (x, a) = E [P t=0 γt R(xt, at) | x0 = x, a0 = a, π ]. An optimal (deterministic) policy π can be extracted from Q : π (a|x) = 1{a = a (x)}, where a (x) arg maxa Q (x, a). For large or continuous state/action spaces, the optimal Q-function can be approximated, e.g., using a deep neural network (DNN) as in DQN (Mnih et al., 2013). In DQN, the value function Qθ is updated using the value label r + γ maxa Qθtarget(x , a ), where Qθtarget is a target Q-function. Instead of training these weights jointly, θtarget is updated in a separate iterative fashion using the previous θ for a fixed number of training steps, or by averaging θtarget τθ + (1 τ)θtarget for some small momentum weight τ [0, 1] (Mnih et al., 2016). DQN is off-policy the target is valid no matter how the experience was generated (as long as it is sufficiently exploratory). Typically, the loss is minimized over mini-batches B of past data (x, a, r, x ) sampled from a large experience replay buffer R (Lin & Mitchell, 1992). One common loss function for training Qθ is mean squared Bellman error: minθ P|B| i=1 (Qθ(xi, ai) ri γ maxa Qθtarget(x i, a ))2 . Under this loss, RL can be viewed as ℓ2-regression of Qθ( , ) w.r.t. target labels r+γ maxa Qθtarget(x , a ). We augment DQN, using double Q-learning for more stable training (Hasselt et al., 2016), whose loss is: ri + γQθtarget(x i, arg max a Qθ(x i, a )) Qθ(xi, ai) 2 . (1) A hinge loss can also be used in Q-learning, and has connections to the linear programming (LP) formulation of the MDP (Puterman (2014)). The optimal Q-network weights can be specified as: minθ 1 |B| P|B| i=1 Qθ(xi, ai) + λ (ri + γ maxa A Qθ(x i, a ) Qθ(xi, ai))+, where λ > 0 is a tun- 1Results can be extended to handle convolutional NNs, but are omitted for brevity. able penalty w.r.t. constraint: r + γ maxa A Qθ(x , a ) Qθ(x, a), (x, a, r, x ) B. To stabilize training, we replace the Q-network of the inner maximization with the target Q-network and the optimal Q-value with the double-Q label, giving (see Appendix A for details): min θ 1 |B| i=1 Qθ(xi, ai) + λ ri + γQθtarget(x i, arg max a Qθ(x i, a )) Qθ(xi, ai) In this work, we assume the Q-function approximation Qθ to be a feed-forward network. Specifically, let Qθ be a K-layer feed-forward NN with state-action input (x, a) (where a lies in a ddimensional real vector space) and hidden layers arranged according to the equations: z1 = (x, a), ˆzj = Wj 1zj 1 + bj 1, zj = h(ˆzj), j = 2, . . . , K, Qθ(x, a) := c ˆz K, 2 (3) where (Wj, bj) are the multiplicative and bias weights, c is the output weight of the Q-network, θ = c, {(Wj, bj)}K 1 j=1 are the weights of the Q-network, ˆzj denotes pre-activation values at layer j, and h( ) is the (component-wise) activation function. For simplicity, in the following analysis, we restrict our attention to the case when the activation functions are Re LU s. We also assume that the action space A is a d-dimensional ℓ -ball B (a, ) with some radius > 0 and center a. Therefore, at any arbitrary state x X the max-Q problem can be re-written as q x := maxa A Qθ(x, a) = max{ˆzj}K j=2,{zj}K 1 j=2 c ˆz K : z1 = (x, a), a B (a, ), eqs. (3) . While the above formulation is intuitive, the nonlinear equality constraints in the neural network formulation (3) makes this problem non-convex and NP-hard (Katz et al., 2017). 3 CONTINUOUS ACTION Q-LEARNING ALGORITHM Policy-based methods (Silver et al., 2014; Fujimoto et al., 2018; Haarnoja et al., 2018) have been widely-used to handle continuous actions in RL. However, they suffer from several well-known difficulties, e.g., (i) modeling high-dimensional action distributions, (ii) handling action constraints, and (iii) data-inefficiency. Motivated by earlier work on value-based RL methods, such as QTOpt (Kalashnikov et al., 2018) and actor-expert (Lim et al., 2018), we propose Continuous Action Q-learning (CAQL), a general framework for continuous-action value-based RL, in which the Qfunction is parameterized by a NN (Eq. 3). One novelty of CAQL is the formulation of the max-Q problem, i.e., the inner maximization in (1) and (2), as a mixed-integer programming (MIP). The benefit of the MIP formulation is that it guarantees that we find the optimal action (and its true bootstrapped Q-value) when computing target labels (and at inference time). We show empirically that this can induce better performance, especially when the Q-network has sufficient representation power. Moreover, since MIP can readily model linear and combinatorial constraints, it offers considerable flexibility when incorporating complex action constraints in RL. That said, finding the optimal Q-label (e.g., with MIP) is computationally intensive. To alleviate this, we develop several approximation methods to systematically reduce the computational demands of the inner maximization. In Sec. 3.2, we introduce the action function to approximate the arg max-policy at inference time, and in Sec. 4 we propose three techniques, dynamic tolerance, dual filtering, and clustering, to speed up max-Q computation during training. 3.1 PLUG-N-PLAY MAX-Q OPTIMIZERS In this section, we illustrate how the max-Q problem, with the Q-function represented by a Re LU network, can be formulated as a MIP, which can be solved using off-the-shelf optimization packages (e.g., SCIP (Gleixner et al., 2018), CPLEX (CPLEX, 2019), Gurobi (Gurobi, 2019)). In addition, we detail how approximate optimizers, specifically, gradient ascent (GA) and the cross-entropy method (CEM), can trade optimality for speed in max-Q computation within CAQL. For ease of exposition, we focus on Q-functions parameterized by a feedforward Re LU network. Extending our methodology (including the MIP formulation) to convolutional networks (with Re LU activation and max pooling) is straightforward (see Anderson et al. (2019)). While GA and CEM can handle generic activation functions beyond Re LU, our MIP requires additional approximations for those that are not piecewise linear. Mixed-Integer Programming (MIP) A trained feed-forward Re LU network can be modeled as a MIP by formulating the nonlinear activation function at each neuron with binary constraints. Specifically, for a Re LU with pre-activation function of form z = max{0, w x + b}, where x [ℓ, u] is a 2Without loss of generality, we simplify the NN by omitting the output bias and output activation function. d-dimensional bounded input, w Rd, b R, and ℓ, u Rd are the weights, bias and lower-upper bounds respectively, consider the following set with a binary variable ζ indicating whether the Re LU is active or not: R(w, b, ℓ, u) = (x, z, ζ) z w x + b, z 0, z w x + b M (1 ζ), z M +ζ, (x, z, ζ) [ℓ, u] R {0, 1} In this formulation, both M + = maxx [ℓ,u] w x + b and M = minx [ℓ,u] w x + b can be computed in linear time in d. We assume M + > 0 and M < 0, otherwise the function can be replaced by z = 0 or z = w x + b. These constraints ensure that z is the output of the Re LU: If ζ = 0, then they are reduced to z = 0 w x+b, and if ζ = 1, then they become z = w x+b 0. This can be extended to the Re LU network in (3) by chaining copies of intermediate Re LU formulations. More precisely, if the Re LU Q-network has mj neurons in layer j {2, . . . , K}, for any given state x X, the max-Q problem can be reformulated as the following MIP: q x = max c z K (4) s.t. z1 := a B (a, ), (zj 1, zj,i, ζj,i) R(Wj,i, bj,i, ℓj 1, uj 1), j {2, . . . , K}, i {1, . . . , mj}, where ℓ1 = a , u1 = a + are the (action) input-bound vectors. Since the output layer of the Re LU NN is linear, the MIP objective is linear as well. Here, Wj,i Rmj and bj,i R are the weights and bias of neuron i in layer j. Furthermore, ℓj, uj are interval bounds for the outputs of the neurons in layer j for j 2, and computing them can be done via interval arithmetic or other propagation methods (Weng et al., 2018) from the initial action space bounds (see Appendix C for details). As detailed by Anderson et al. (2019), this can be further tightened with additional constraints, and its implementation can be found in the tf.opt package described therein. As long as these bounds are redundant, having these additional box constraints will not affect optimality. We emphasize that the MIP returns provably global optima, unlike GA and CEM. Even when interrupted with stopping conditions such as a time limit, MIP often produces high-quality solutions in practice. In theory, this MIP formulation can be solved in time exponential on the number of Re LUs and polynomial on the input size (e.g., by naively solving an LP for each binary variable assignment). In practice however, a modern MIP solver combines many different techniques to significantly speed up this process, such as branch-and-bound, cutting planes, preprocessing techniques, and primal heuristics (Linderoth & Savelsbergh, 1999). Versions of this MIP model have been used in neural network verification (Cheng et al., 2017; Lomuscio & Maganti, 2017; Bunel et al., 2018; Dutta et al., 2018; Fischetti & Jo, 2018; Anderson et al., 2019; Tjeng et al., 2019) and analysis (Serra et al., 2018; Kumar et al., 2019), but its application to RL is novel. While Say et al. (2017) also proposed a MIP formulation to solve the planning problem with non-linear state transition dynamics model learned with a NN, it is different than ours, which solves the max-Q problem. Gradient Ascent GA (Nocedal & Wright, 2006) is a simple first-order optimization method for finding the (local) optimum of a differentiable objective function, such as a neural network Qfunction. At any state x X, given a seed action a0, the optimal action arg maxa Qθ (x, a) is computed iteratively by at+1 at + η a Qθ(x, a), where η > 0 is a step size (either a tunable parameter or computed using back-tracking line search (Nocedal & Yuan, 1998)). This process repeats until convergence, |Qθ(x, at+1) Qθ(x, at)| < ϵ, or a maximum iteration count is reached. Cross-Entropy Method CEM (Rubinstein, 1999) is a derivative-free optimization algorithm. At any given state x X, it samples a batch of N actions {ai}N i=1 from A using a fixed distribution (e.g., a Gaussian) and ranks the corresponding Q-values {Qθ(x, ai)}N i=1. Using the top K < N actions, it then updates the sampling distribution, e.g., using the sample mean and covariance to update the Gaussian. This is repeated until convergence or a maximum iteration count is reached. 3.2 ACTION FUNCTION In traditional Q-learning, the policy π is implemented by acting greedily w.r.t. the learned Qfunction: π (x) = arg maxa Qθ (x, a).3 However, computing the optimal action can be expensive in the continuous case, which may be especially problematic at inference time (e.g., when computational power is limited in, say embedded systems, or real-time response is critical). To mitigate the problem, we can use an action function πw : X A effectively a trainable actor network to 3Some exploration strategy may be incorporated as well. approximate the greedy-action mapping π . We train πw using training data B = {(xi, q i )}|B| i=1, where q i is the max-Q label at state xi. Action function learning is then simply a supervised regression problem: w arg minw P|B| i=1(q i Qθ(xi, πw(xi)))2. This is similar to the notion of distilling an optimal policy from max-Q labels, as in actor-expert (Lim et al., 2018). Unlike actorexpert a separate stochastic policy network is jointly learned with the Q-function to maximize the likelihood with the underlying optimal policy our method learns a state-action mapping to approximate arg maxa Qθ (x, a) this does not require distribution matching and is generally more stable. The use of action function in CAQL is simply optional to accelerate data collection and inference. 4 ACCELERATING MAX-Q COMPUTATION In this section, we propose three methods to speed up the computationally-expensive max-Q solution during training: (i) dynamic tolerance, (ii) dual filtering, and (iii) clustering. Dynamic Tolerance Tolerance plays a critical role in the stopping condition of nonlinear optimizers. Intuitively, in the early phase of CAQL, when the Q-function estimate has high Bellman error, it may be wasteful to compute a highly accurate max-Q label when a crude estimate can already guide the gradient of CAQL to minimize the Bellman residual. We can speed up the max-Q solver by dynamically adjusting its tolerance τ > 0 based on (a) the TD-error, which measures the estimation error of the optimal Q-function, and (b) the training step t > 0, which ensures the bias of the gradient (induced by the sub-optimality of max-Q solver) vanishes asymptotically so that CAQL converges to a stationary point. While relating tolerance with the Bellman residual is intuitive, it is impossible to calculate that without knowing the max-Q label. To resolve this circular dependency, notice that the action function πw approximates the optimal policy, i.e., πw( |x) arg maxa Qθ(x, ). We therefore replace the optimal policy with the action function in Bellman residual and propose the dynamic tolerance: τt := 1 |B| P|B| i=1 |ri +γQθtarget t (x i, πwt(x i)) Qθt(xi, ai)| k1 kt 2, where k1 > 0 and k2 [0, 1) are tunable parameters. Under standard assumptions, CAQL with dynamic tolerance {τt} converges a.s. to a stationary point (Thm. 1, (Carden, 2014)). Dual Filtering The main motivation of dual filtering is to reduce the number of max-Q problems at each CAQL training step. For illustration, consider the formulation of hinge Q-learning in (2). Denote by q x ,θtarget the max-Q label w.r.t. the target Q-network and next state x . The structure of the hinge penalty means the TD-error corresponding to sample (x, a, x , r) is inactive whenever q x ,θtarget (Qθ(x, a) r)/γ this data can be discarded. In dual filtering, we efficiently estimate an upper bound on q x ,θtarget using some convex relaxation to determine which data can be discarded before max-Q optimization. Specifically, recall that the main source of non-convexity in (3) comes from the equality constraint of the Re LU activation function at each NN layer. Similar to MIP formulation, assume we have component-wise bounds (lj, uj), j = 2, . . . , K 1 on the neurons, such that lj ˆzj uj. The Re LU equality constraint H(l, u) := {(h, k) R2 : h [l, u], k = [h]+} can be relaxed using a convex outer-approximation (Wong & Kolter, 2017): H(l, u) := {(h, k) R2 : k h, k 0, uh + (u l)k ul}. We use this approximation to define the relaxed NN equations, which replace the nonlinear equality constraints in (3) with the convex set H(l, u). We denote the optimal Q-value w.r.t. the relaxed NN as q x , which is by definition an upper bound on q x i. Hence, the condition: q x ,θtarget (Qθ(x, a) r)/γ is a conservative certificate for checking whether the data (x, a, x , r) is inactive. For further speed up, we estimate q x with its dual upper bound (see Appendix C for derivations) qx := (ˆν1) a m ˆν1 q (ˆν1) x + PK 1 j=2 P s Ij lj(s)[νj(s)]+ PK 1 j=1 ν j+1bi, where ν is defined by the following recursion dual network: νK := c, ˆνj := W j νj+1, j = 1, . . . , K 1, νj := Djˆνj, j = 2, . . . , K 1, and Ds is a diagonal matrix with [Dj](s, s) = 1{s I+ j }+uj(s)/(uj(s) lj(s)) 1{s Ij}, and replace the above certificate with an even more conservative one: eqx ,θtarget (Qθ(x, a) r)/γ. Although dual filtering is derived for hinge Q-learning, it also applies to the ℓ2-loss counterpart by replacing the optimal value q x ,θtarget with its dual upper-bound estimate eqx ,θtarget whenever the verification condition holds (i.e., the TD error is negative). Since the dual estimate is greater than the primal, the modified loss function will be a lower bound of the original in (1), i.e., (r+γ qx ,θtarget Qθ(x, a))2 (r+γq x ,θtarget Qθ(x, a))2 whenever r+γ qx ,θtarget Qθ(x, a) 0, which can stabilize training by reducing over-estimation error. One can utilize the inactive samples in the action function (πw) learning problem by replacing the max-Q label q x ,θ with its dual approximation eqx ,θ. Since eqx ,θ q x ,θ, this replacement will not affect optimality. 4 Clustering To reduce the number of max-Q solves further still, we apply online state aggregation (Meyerson, 2001), which picks a number of centroids from the batch of next states B as the centers of p-metric balls with radius b > 0, such that the union of these balls form a minimum covering of B . Specifically, at training step t {0, 1, . . .}, denote by Ct(b) B the set of next-state centroids. For each next state c Ct(b), we compute the max-Q value q c ,θtarget = maxa Qθtarget(c , a ), where a c is the corresponding optimal action. For all remaining next states x B \ Ct(b), we approximate their max-Q values via first-order Taylor series expansion ˆqx ,θtarget := q c ,θtarget + x Qθtarget(x , a )|x =c ,a =a c , (x c ) in which c is the closest centroid to x , i.e., c arg minc Ct(b) x c p. By the envelope theorem for arbitrary choice sets (Milgrom & Segal, 2002), the gradient x maxa Qθtarget(x , a ) is equal to x Qθtarget(x , a )|a =a x . In this approach the cluster radius r > 0 controls the number of max-Q computations, which trades complexity for accuracy in Bellman residual estimation. This parameter can either be a tuned or adjusted dynamically (similar to dynamic tolerance), e.g., rt = k3 kt 4 with hyperparameters k3 > 0 and k4 [0, 1). Analogously, with this exponentially-decaying cluster radius schedule we can argue that the bias of CAQL gradient (induced by max-Q estimation error due to clustering) vanishes asymptotically, and the corresponding Q-function converges to a stationary point. To combine clustering with dual filtering, we define B df as the batch of next states that are inconclusive after dual filtering, i.e., B df = {x B : eqx ,θtarget > (Qθ(x, a) r)/γ}. Then instead of applying clustering to B we apply this method onto the refined batch B df. Dynamic tolerance not only speeds up training, but also improves CAQL s performance (see Tables 4 and 5); thus, we recommend using it by default. Dual filtering and clustering both trade off training speed with performance. These are practical options with tunable parameters that allow practitioners to explore their utility in specific domains. 5 EXPERIMENTS ON MUJOCO BENCHMARKS To illustrate the effectiveness of CAQL, we (i) compare several CAQL variants with several state-ofthe-art RL methods on multiple domains, and (ii) assess the trade-off between max-Q computation speed and policy quality via ablation analysis. Comparison with Baseline RL Algorithms We compare CAQL with four baseline methods, DDPG (Silver et al., 2014), TD3 (Fujimoto et al., 2018), and SAC (Haarnoja et al., 2018) three popular policy-based deep RL algorithms and NAF (Gu et al., 2016), a value-based method using an action-quadratic Q-function. We train CAQL using three different max-Q optimizers, MIP, GA, and CEM. Note that CAQL-CEM counterpart is similar to QT-Opt (Kalashnikov et al., 2018) and CAQL-GA reflects some aspects actor-expert (Lim et al., 2018). These CAQL variants allow assessment of the degree to which policy quality is impacted by Q-learning with optimal Bellman residual (using MIP) rather than an approximation (using GA or CEM), at the cost of steeper computation. To match the implementations of the baselines, we use ℓ2 loss when training CAQL. Further ablation analysis on CAQL with ℓ2 loss vs. hinge loss is provided in Appendix E. We evaluate CAQL on one classical control benchmark (Pendulum) and five Mu Jo Co benchmarks (Hopper, Walker2D, Half Cheetah, Ant, Humanoid).5 Different than most previous work, we evaluate the RL algorithms on domains not just with default action ranges, but also using smaller, constrained action ranges (see Table 6 in Appendix D for action ranges used in our experiments).6 The motivation for this is two-fold: (i) To simulate real-world problems (Dulac-Arnold et al., 2019), where the restricted ranges represent the safe/constrained action sets; (ii) To validate the hypothesis that action-distribution learning in policy-based methods cannot easily handle such constraints, while CAQL does so, illustrating its flexibility. For problems with longer horizons, a larger neural network is often required to learn a good policy, which in turn significantly increases the complexity 4One can also use the upper bound eqx ,θ as the label for training the action function in Section 3.2. Empirically, this approach often leads to better policy performance. 5Since the objective of these experiments is largely to evaluate different RL algorithms, we do not exploit problem structures, e.g., symmetry, when training policies. 6Smaller action ranges often induce easier MIP problems in max-Q computation. However, given the complexity of MIP in more complex environments such as Walker2D, Half Cheetah, Ant, and Humanoid, we run experiments only with action ranges smaller than the defaults. Env. [Action range] CAQL-MIP CAQL-GA CAQL-CEM NAF DDPG TD3 SAC Pendulum [-0.66, 0.66] -339.5 158.3 -342.4 151.6 -394.6 246.5 -449.4 280.5 -407.3 180.3 -488.8 232.3 -789.6 299.5 Pendulum [-1, 1] -235.9 122.7 -237.0 135.5 -236.1 116.3 -312.7 242.5 -252.8 163.0 -279.5 186.8 -356.6 288.7 Pendulum [-2, 2] -143.2 161.0 -145.5 136.1 -144.5 208.8 -145.2 168.9 -146.2 257.6 -142.3 195.9 -163.3 190.1 Hopper [-0.25, 0.25] 343.2 62.6 329.7 59.4 276.9 97.4 237.8 100.0 252.2 98.1 217.1 73.7 309.3 73.0 Hopper [-0.5, 0.5] 411.7 115.2 341.7 139.9 342.9 142.1 248.2 113.2 294.5 108.7 280.1 80.0 309.1 95.8 Hopper [-1, 1] 459.8 144.9 427.5 151.2 417.2 145.4 245.9 140.7 368.2 139.3 396.3 132.8 372.3 138.5 Walker2D [-0.25, 0.25] 276.3 118.5 285.6 97.6 283.7 104.6 219.9 120.8 270.4 104.2 250.0 78.3 284.0 114.5 Walker2D [-0.5, 0.5] 288.9 118.1 295.6 113.9 304.7 116.1 233.7 99.4 259.0 110.7 243.8 116.4 287.0 128.3 Half Cheetah [-0.25, 0.25] 394.8 43.8 337.4 60.0 339.1 137.9 247.3 96.0 330.7 98.9 264.3 142.2 325.9 38.6 Half Cheetah [-0.5, 0.5] 718.6 199.9 736.4 122.8 686.7 224.1 405.1 243.2 456.3 238.5 213.8 214.6 614.8 69.4 Ant [-0.1, 0.1] 402.3 27.4 406.2 32.6 378.2 39.7 295.0 44.2 374.0 35.9 268.9 73.2 281.4 65.3 Ant [-0.25, 0.25] 413.1 60.0 443.1 65.6 451.4 54.8 323.0 60.8 444.2 63.3 472.3 61.9 399.3 59.2 Humanoid [-0.1, 0.1] 405.7 112.5 431.9 244.8 397.0 145.7 392.7 169.9 494.4 182.0 352.8 150.3 456.3 112.4 Humanoid [-0.25, 0.25] 460.2 143.2 622.8 158.1 529.8 179.9 374.6 126.5 582.1 176.7 348.1 106.3 446.9 103.9 Table 1: The mean standard deviation of (95-percentile) final returns with the best hyperparameter configuration. CAQL significantly outperforms NAF on most benchmarks, as well as DDPG, TD3, and SAC on 11/14 benchmarks. 0 50 100 150 200 Pendulum [-0.66, 0.66] 0 50 100 150 200 Pendulum [-1.0, 1.0] 0 50 100 150 200 Pendulum [-2.0, 2.0] MIP GA CEM NAF DDPG TD3 SAC 0 50 100 150 200 Hopper [-0.25, 0.25] 0 50 100 150 200 Hopper [-0.5, 0.5] 0 50 100 150 200 0 Hopper [-1.0, 1.0] 0 50 100 150 200 Walker2D [-0.25, 0.25] 0 50 100 150 200 Walker2D [-0.5, 0.5] 0 100 200 300 400 500 Half Cheetah [-0.25, 0.25] 0 100 200 300 400 500 Half Cheetah [-0.5, 0.5] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.25, 0.25] 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) 700 Humanoid [-0.25, 0.25] Figure 1: Mean cumulative reward of the best hyper parameter configuration over 10 random seeds. Data points are average over a sliding window of size 6. The length of an episode is limited to 200 steps. The training curves with standard deviation are given in Figure 4 in Appendix E. of the MIP. To reduce this computational cost, we reduce episode length in each experiment from 1000 to 200 steps, and parameterize the Q-function with a relatively simple 32 16 feedforward Re LU network. With shorter episodes and smaller networks, the returns of our experiments are lower than those reported in state-of-the-art RL benchmarks (Duan et al., 2016). Details on network architectures and hyperparameters are described in Appendix D. For the more difficult Mu Jo Co environments (i.e., Ant, Half Cheetah, Humanoid), the number of training steps is set to 500, 000, while for simpler ones (i.e., Pendulum, Hopper, Walker2D), it is set to 200, 000. Policy performance is evaluated every 1000 training iterations, using a policy with no exploration. Each measurement is an average return over 10 episodes, each generated using a separate random seed. To smooth learning curves, data points are averaged over a sliding window of size 6. Similar to the setting of Lim et al. (2018), CAQL measurements are based on trajectories that are generated by the learned action function instead of the optimal action w.r.t. the Q-function. Table 1 and Figure 1 show the average return of CAQL and the baselines under the best hyperparameter configurations. CAQL significantly outperforms NAF on most benchmarks, as well as DDPG, TD3, and SAC on 11 of 14 benchmarks. Of all the CAQL policies, those trained using MIP are among the best performers in low-dimensional benchmarks (e.g., Pendulum and Hopper). This verifies our conjecture about CAQL: Q-learning with optimal Bellman residual (using MIP) performs better than using approximation (using GA, CEM) when the Q-function has sufficient representation power (which is more likely in low-dimensional tasks). Moreover, CAQL-MIP policies have slightly lower variance than those trained with GA and CEM on most benchmarks. Table 2 and Figure 2 show summary statistics of the returns of CAQL and the baselines on all 320 configurations Env. [Action range] CAQL-MIP CAQL-GA CAQL-CEM NAF DDPG TD3 SAC Pendulum [-0.66, 0.66] -780.5 345.0 -766.6 344.2 -784.7 349.3 -775.3 353.4 -855.2 331.2 -942.1 308.3 -1144.8 195.3 Pendulum [-1, 1] -508.1 383.2 -509.7 383.5 -500.7 382.5 -529.5 377.4 -623.3 395.2 -730.3 389.4 -972.0 345.5 Pendulum [-2, 2] -237.3 487.2 -250.6 508.1 -249.7 488.5 -257.4 370.3 -262.0 452.6 -361.3 473.2 -639.5 472.7 Hopper [-0.25, 0.25] 292.7 93.3 210.8 125.3 196.9 130.1 176.6 109.1 178.8 126.6 140.5 106.5 225.0 84.9 Hopper [-0.5, 0.5] 332.2 119.7 222.2 138.5 228.1 135.7 192.8 101.6 218.3 129.6 200.2 100.7 243.6 81.6 Hopper [-1, 1] 352.2 141.3 251.5 153.6 242.3 153.8 201.9 126.2 248.0 148.3 248.2 124.4 263.6 118.9 Walker2D [-0.25, 0.25] 247.6 109.0 213.5 111.3 206.7 112.9 190.5 117.5 209.9 103.6 204.8 113.3 224.5 105.1 Walker2D [-0.5, 0.5] 213.1 120.0 209.5 112.5 209.3 112.5 179.7 100.9 210.8 108.3 173.1 101.1 220.9 114.8 Half Cheetah [-0.25, 0.25] 340.9 110.2 234.3 136.5 240.4 143.1 169.7 123.7 228.9 118.1 192.9 136.8 260.7 108.5 Half Cheetah [-0.5, 0.5] 395.1 275.2 435.5 273.7 377.5 280.5 271.8 226.9 273.8 199.5 119.8 139.6 378.3 219.6 Ant [-0.1, 0.1] 319.4 69.3 327.5 67.5 295.5 71.9 260.2 53.1 298.4 67.6 213.7 40.8 205.9 34.9 Ant [-0.25, 0.25] 362.3 60.3 388.9 63.9 392.9 67.1 270.4 72.5 381.9 63.3 377.1 93.5 314.3 88.2 Humanoid [-0.1, 0.1] 326.6 93.5 235.3 165.4 227.7 143.1 261.6 154.1 259.0 188.1 251.7 127.5 377.5 90.3 Humanoid [-0.25, 0.25] 267.0 163.8 364.3 215.9 309.4 186.3 270.2 124.6 347.3 220.8 262.8 109.2 381.0 84.4 Table 2: The mean standard deviation of (95-percentile) final returns over all 320 configurations (32 hyper parameter combinations 10 random seeds). CAQL policies are less sensitive to hyper parameters on 11/14 benchmarks. 0 50 100 150 200 600 Pendulum [-0.66, 0.66] 0 50 100 150 200 1600 Pendulum [-1.0, 1.0] 0 50 100 150 200 Pendulum [-2.0, 2.0] MIP GA CEM NAF DDPG TD3 SAC 0 50 100 150 200 Hopper [-0.25, 0.25] 0 50 100 150 200 350 Hopper [-0.5, 0.5] 0 50 100 150 200 Hopper [-1.0, 1.0] 0 50 100 150 200 0 Walker2D [-0.25, 0.25] 0 50 100 150 200 Walker2D [-0.5, 0.5] 0 100 200 300 400 500 Half Cheetah [-0.25, 0.25] 0 100 200 300 400 500 Half Cheetah [-0.5, 0.5] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.25, 0.25] 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.25, 0.25] Figure 2: Mean cumulative reward over all 320 configurations (32 hyper parameter combinations 10 random seeds). Data points are average over a sliding window of size 6. The length of an episode is limited to 200 steps. The training curves with standard deviation are in Figure 5 in Appendix E. (32 hyperparameter combinations 10 random seeds) and illustrates the sensitivity to hyperparameters of each method. CAQL is least sensitive in 11 of 14 tasks, and policies trained using MIP optimization, specifically, are best in 6 of 14 tasks. This corroborates the hypothesis that valuebased methods are generally more robust to hyperparameters than their policy-based counterparts. Table 9 in Appendix E.1 compares the speed (in terms of average elapsed time) of various max-Q solvers (MIP, GA, and CEM), with MIP clearly the most computationally intensive. We note that CAQL-MIP suffers from performance degradation in several high-dimensional environments with large action ranges (e.g., Ant [-0.25, 0.25] and Humanoid [-0.25, 0.25]). In these experiments, its performance is even worse than that of CAQL-GA or CAQL-CEM. We speculate that this is due to the fact that the small Re LU NN (32 16) doesn t have enough representation power to accurately model the Q-functions in more complex tasks, and therefore optimizing for the true max-Q value using an inaccurate function approximation impedes learning. We also test CAQL using the standard Mu Jo Co 1000-step episode length, using gradient ascent as the optimizer, and a Q-function is parameterized with a 200 100 feedforward Re LU network for Hopper and with 400 300 for the rest benchmarks. CAQL-GA is trained using dynamic tolerance and an action function but without dual filtering or clustering. Figure 6 in Appendix E shows that CAQL-GA performs better than, or similar to, the best of the baseline methods, except on Hopper [-0.25, 0.25] SAC performed best in that setting, however, it suffers from very high performance variance. Ablation Analysis We now study the effects of using dynamic tolerance, dual filtering, and clustering on CAQL via two ablation analyses. For simplicity, we experiment on standard benchmarks Env. [Action range] GA GA + DF GA + DF + C(0.25) GA + DF + C(0.5) Dual Pendulum [-2, 2] -144.6 154.2 -146.1 229.8 -146.7 216.8 -149.9 215.0 -175.1 246.8 (R: 0.00%) (R: 26.5%) (R: 79.7%) (R: 81.2%) (R: 100%) Hopper [-1, 1] 414.9 181.9 424.8 176.7 396.3 138.8 371.2 171.7 270.2 147.6 (R: 0.00%) (R: 9.11%) (R: 38.2%) (R: 61.5%) (R: 100%) Walker2D [-1, 1] 267.6 107.3 236.1 99.6 249.2 125.1 235.6 108.1 201.5 136.0 (R: 0.00%) (R: 12.4%) (R: 33.6%) (R: 50.8%) (R: 100%) Half Cheetah [-1, 1] 849.0 108.9 737.3 170.2 649.5 146.7 445.4 207.8 406.6 206.4 (R: 0.00%) (R: 5.51%) (R: 15.4%) (R: 49.1%) (R: 100%) Ant [-0.5, 0.5] 370.0 106.5 275.1 92.1 271.5 94.3 214.0 75.4 161.1 54.7 (R: 0.00%) (R: 3.19%) (R: 14.3%) (R: 31.1%) (R: 100%) Humanoid [-0.4, 0.4] 702.7 162.9 513.9 146.0 458.2 120.4 387.7 117.4 333.3 117.9 (R: 0.00%) (R: 13.8%) (R: 52.1%) (R: 80.8%) (R: 100%) Table 3: Ablation analysis on CAQL-GA with dual filtering and clustering, where both the mean standard deviation of (95-percentile) final returns and the average %-max-Q-reduction (in parenthesis) are based on the best configuration. See Figure 7 in Appendix E for training curves. Env. [Action range] GA + Tol(1e-6) GA + Tol(100) GA + DTol(100,1e-6) GA + DTol(1,1e-6) GA + DTol(0.1,1e-6) Pendulum [-2, 2] -144.5 195.6 -158.1 165.0 -144.1 159.2 -143.7 229.5 -144.2 157.6 (# GA Itr: 200) (# GA Itr: 1) (# GA Itr: 45.4) (# GA Itr: 58.1) (# GA Itr: 69.5) Hopper [-1, 1] 371.4 199.9 360.4 158.9 441.4 141.3 460.0 127.8 452.5 137.0 (# GA Itr: 200) (# GA Itr: 1) (# GA Itr: 46.0) (# GA Itr: 59.5) (# GA Itr: 78.4) Walker2D [-1, 1] 273.6 112.4 281.6 121.2 282.0 104.5 309.0 118.8 292.7 113.8 (# GA Itr: 200) (# GA Itr: 1) (# GA Itr: 47.4) (# GA Itr: 59.8) (# GA Itr: 71.2) Half Cheetah [-1, 1] 837.5 130.6 729.3 313.8 896.6 145.7 864.4 123.1 894.0 159.9 (# GA Itr: 200) (# GA Itr: 1) (# GA Itr: 91.2) (# GA Itr: 113.5) (# GA Itr: 140.7) Ant [-0.5, 0.5] 373.1 118.5 420.7 148.8 ( ) 364.4 111.324 388.2 110.7 429.3 139.3 (# GA Itr: 200) (# GA Itr: 1) (# GA Itr: 86.3) (# GA Itr: 109.5) (# GA Itr: 139.7) Humanoid [-0.4, 0.4] 689.8 193.9 500.2 214.9 716.2 191.4 689.5 191.6 710.9 188.4 (# GA Itr: 200) (# GA Itr: 1) (# GA Itr: 88.6) (# GA Itr: 115.9) (# GA Itr: 133.4) Table 4: Ablation analysis on CAQL-GA with dynamic tolerance, where both the mean standard deviation of (95-percentile) final returns and the average number of GA iterations (in parenthesis) are based on the best configuration. See Figure 9 in Appendix E for training curves. NOTE: In ( ) the performance significantly drops after hitting the peak, and learning curve does not converge. (with full action ranges), and primarily test CAQL-GA using an ℓ2 loss. Default values on tolerance and maximum iteration are 1e-6 and 200, respectively. Table 3 shows how reducing the number of max-Q problems using dual filtering and clustering affects performance of CAQL. Dual filtering (DF) manages to reduce the number of max-Q problems (from 3.2% to 26.5% across different benchmarks), while maintaining similar performance with the unfiltered CAQL-GA. On top of dual filtering we apply clustering (C) to the set of inconclusive next states B df, in which the degree of approximation is controlled by the cluster radius. With a small cluster radius (e.g., b = 0.1), clustering further reduces max-Q solves without significantly impacting training performance (and in some cases it actually improves performance), though further increasing the radius would significant degrade performance. To illustrate the full trade-off of max-Q reduction versus policy quality, we also include the Dual method, which eliminates all max Q computation with the dual approximation. Table 4 shows how dynamic tolerance influences the quality of CAQL policies. Compared with the standard algorithm, with a large tolerance (τ = 100) GA achieves a notable speed up (with only 1 step per max-Q optimization) in training but incurs a loss in performance. GA with dynamic tolerance atttains the best of both worlds it significantly reduces inner-maximization steps (from 29.5% to 77.3% across different problems and initial τ settings), while achieving good performance. Additionally, Table 5 shows the results of CAQL-MIP with dynamic tolerance (i.e., optimality gap). This method significantly reduces both median and variance of the MIP elapsed time, while having better performance. Dynamic tolerance eliminates the high latency in MIP observed in the early phase of training (see Figure 3). Env. [Action range] MIP + Tol(1e-4) MIP + DTol(1,1e-4) Half Cheetah [-0.5, 0.5] 718.6 199.9 (Med(κ): 263.5, SD(κ): 88.269) 764.5 132.9 (Med(κ): 118.5, SD(κ): 75.616) Ant [-0.1, 0.1] 402.3 27.4 (Med(κ): 80.7, SD(κ): 100.945) 404.9 27.7 (Med(κ): 40.3, SD(κ): 24.090) Ant [-0.25, 0.25] 413.1 60.0 (Med(κ): 87.6, SD(κ): 160.921) 424.9 60.9 (Med(κ): 62.0, SD(κ): 27.646) Humanoid [-0.1, 0.1] 405.7 112.5 (Med(κ): 145.7, SD(κ): 27.381) 475.0 173.4 (Med(κ): 29.1, SD(κ): 10.508) Humanoid [-0.25, 0.25] 460.2 143.2 (Med(κ): 71.2, SD(κ): 45.763) 410.1 174.4 (Med(κ): 39.7, SD(κ): 11.088) Table 5: Ablation analysis on CAQL-MIP with dynamic tolerance, where both the mean standard deviation of (95-percentile) final returns and the (median, standard deviation) of the elapsed time κ (in msec) are based on the best configuration. See Figure 11 in Appendix E for training curves. 6 CONCLUSIONS AND FUTURE WORK We proposed Continuous Action Q-learning (CAQL), a general framework for handling continuous actions in value-based RL, in which the Q-function is parameterized by a neural network. While generic nonlinear optimizers can be naturally integrated with CAQL, we illustrated how the inner maximization of Q-learning can be formulated as mixed-integer programming when the Qfunction is parameterized with a Re LU network. CAQL (with action function learning) is a general Q-learning framework that includes many existing value-based methods such as QT-Opt and actorexpert. Using several benchmarks with varying degrees of action constraint, we showed that the policy learned by CAQL-MIP generally outperforms those learned by CAQL-GA and CAQL-CEM; and CAQL is competitive with several state-of-the-art policy-based RL algorithms, and often outperforms them (and is more robust) in heavily-constrained environments. Future work includes: extending CAQL to the full batch learning setting, in which the optimal Q-function is trained using only offline data; speeding up the MIP computation of the max-Q problem to make CAQL more scalable; and applying CAQL to real-world RL problems. B. Amos, L. Xu, and Z. Kolter. Input convex neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 146 155. PMLR, 2017. R. Anderson, J. Huchette, W. Ma, C. Tjandraatmadja, and J. P. Vielma. Strong mixed-integer programming formulations for trained neural networks. ar Xiv preprint ar Xiv:1811.01988, 2019. C. Boutilier, A. Cohen, A. Hassidim, Y. Mansour, O. Meshi, M. Mladenov, and D. Schuurmans. Planning and learning with stochastic action sets. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 18, pp. 4674 4682. AAAI Press, 2018. R. Bunel, I. Turkaslan, P. Torr, P. Kohli, and M. P. Kumar. A unified view of piecewise linear neural network verification. In Advances in Neural Information Processing Systems 31, pp. 4790 4799. 2018. S. Carden. Convergence of a Q-learning variant for continuous states and actions. Journal of Artificial Intelligence Research, 49:705 731, 2014. C.-H. Cheng, G. N uhrenberg, and H. Ruess. Maximum resilience of artificial neural networks. In International Symposium on Automated Technology for Verification and Analysis, pp. 251 268. Springer, 2017. Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A Lyapunov-based approach to safe reinforcement learning. In Advances in Neural Information Processing Systems 31, pp. 8092 8101. Curran Associates, Inc., 2018. IBM ILOG CPLEX. V12.1: Userˆas manual for CPLEX. 2019. URL https://www.ibm.com/ products/ilog-cplex-optimization-studio. Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329 1338, 2016. G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin. Deep reinforcement learning in large discrete action spaces. ar Xiv preprint ar Xiv:1512.07679, 2015. G. Dulac-Arnold, D. Mankowitz, and T. Hester. Challenges of real-world reinforcement learning. ar Xiv preprint ar Xiv:1904.12901, 2019. S. Dutta, S. Jha, S. Sankaranarayanan, and A. Tiwari. Output range analysis for deep feedforward neural networks. In NASA Formal Methods Symposium, pp. 121 138. Springer, 2018. M. Fazel, R. Ge, S. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 1467 1476, Stockholm, Sweden, 2018. PMLR. M. Fischetti and J. Jo. Deep neural networks and mixed integer linear optimization. Constraints, 2018. S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 1587 1596, Stockholm, Sweden, 2018. PMLR. C. Gaskett, D. Wettergreen, and A. Zelinsky. Q-learning in continuous state and action spaces. In Proceedings of the 12th Australian Joint Conference on Artificial Intelligence, pp. 417 428. Springer, 1999. J. Gauci, E. Conti, Y. Liang, K. Virochsiri, Y. He, Z. Kaden, V. Narayanan, and X. Ye. Horizon: Facebook s open source applied reinforcement learning platform. ar Xiv preprint ar Xiv:1811.00260, 2018. A. Gleixner, M. Bastubbe, L. Eifler, T. Gally, G. Gamrath, R. L. Gottwald, G. Hendel, C. Hojny, T. Koch, M. E. L ubbecke, S. J. Maher, M. Miltenberger, B. M uller, M. E. Pfetsch, C. Puchert, D. Rehfeldt, F. Schl osser, C. Schubert, F. Serrano, Y. Shinano, J. M. Viernickel, M. Walter, F. Wegscheider, J. T. Witt, and J. Witzig. The SCIP Optimization Suite 6.0. Technical report, Optimization Online, July 2018. URL http://www.optimization-online.org/DB_ HTML/2018/07/6692.html. S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with model-based acceleration. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pp. 2829 2838, New York, NY, USA, 2016. PMLR. Gurobi. Gurobi optimizer reference manual, 2019. URL http://www.gurobi.com. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 1861 1870, Stockholm, Sweden, 2018. PMLR. H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI 16, pp. 2094 2100. AAAI Press, 2016. L. Baird III and A. Klopf. Reinforcement learning with high-dimensional, continuous actions. Technical report, Wright Lab Wright-Patterson AFB OH, 1993. D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Proceedings of The 2nd Conference on Robot Learning, volume 87, pp. 651 673. PMLR, 2018. G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp. 97 117. Springer, 2017. A. Kumar, T. Serra, and S. Ramalingam. Equivalent and approximate transformations of deep neural networks. ar Xiv preprint ar Xiv:1905.11428, 2019. A. Lazaric, M. Restelli, and A. Bonarini. Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. In Advances in Neural Information Processing Systems, pp. 833 840, 2008. S. Lim, A. Joseph, L. Le, Y. Pan, and M. White. Actor-Expert: A framework for using action-value methods in continuous action spaces. ar Xiv preprint ar Xiv:1810.09103, 2018. L. Lin and T. Mitchell. Memory approaches to reinforcement learning in non-Markovian domains. Technical report, Pittsburgh, PA, USA, 1992. J. Linderoth and M. Savelsbergh. A computational study of search strategies for mixed integer programming. INFORMS Journal on Computing, 11(2):173 187, 1999. A. Lomuscio and L. Maganti. An approach to reachability analysis for feed-forward Re LU neural networks. ar Xiv preprint ar Xiv:1706.07351, 2017. S. Mannor, R. Rubinstein, and Y. Gat. The cross entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning, ICML 03, pp. 512 519. AAAI Press, 2003. F. Melo and M. Ribeiro. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pp. 308 322. Springer, 2007. A. Meyerson. Online facility location. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pp. 426 431. IEEE, 2001. P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583 601, 2002. J. Mill an, D. Posenato, and E. Dedieu. Continuous-action Q-learning. Machine Learning, 49(2-3): 247 265, 2002. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing Atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. V. Mnih, A. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928 1937, 2016. G. Neu, A. Jonsson, and V. G omez. A unified view of entropy-regularized Markov decision processes. ar Xiv preprint ar Xiv:1705.07798, 2017. J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006. J. Nocedal and Y. Yuan. Combining trust region and line search techniques. In Advances in nonlinear programming, pp. 153 175. Springer, 1998. J. Peters and S. Schaal. Policy gradient methods for robotics. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2219 2225. IEEE, 2006. M. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014. D. Quillen, E. Jang, O. Nachum, C. Finn, J. Ibarz, and S. Levine. Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods. Co RR, abs/1802.10264, 2018. URL http://arxiv.org/abs/1802.10264. R. Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology And Computing In Applied Probability, 1(2):127 190, 1999. B. Say, G. Wu, Y. Q. Zhou, and S. Sanner. Nonlinear hybrid planning with deep net learned transition models and mixed-integer linear programming. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 17, pp. 750 756, 2017. T. Serra, C. Tjandraatmadja, and S. Ramalingam. Bounding and counting linear regions of deep neural networks. In International Conference on Machine Learning, pp. 4565 4573, 2018. D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, pp. I 387 I 395. JMLR.org, 2014. W. Smart and L. Kaelbling. Practical reinforcement learning in continuous spaces. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML 00, pp. 903 910, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. R. Sutton, D. Mc Allester, S. Singh P, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057 1063, 2000. V. Tjeng, K. Xiao, and R. Tedrake. Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations, 2019. W. Uther and M. Veloso. Tree based discretization for continuous state space reinforcement learning. In Proceedings of AAAI 98, pp. 769 774, 1998. C. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279 292, 1992. T. Weng, H. Zhang, H. Chen, Z. Song, C. Hsieh, D. Boning, I. Dhillon, and L. Daniel. Towards fast computation of certified robustness for Re LU networks. ar Xiv preprint ar Xiv:1804.09699, 2018. E. Wong and Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. ar Xiv preprint ar Xiv:1711.00851, 2017. A HINGE Q-LEARNING Consider an MDP with states X, actions A, transition probability function P, discount factor γ [0, 1), reward function R, and initial state distribution β. We want to find an optimal Q-function by solving the following optimization problem: x X,a A p(x, a)Q(x, a) Q(x, a) R(x, a) + γ X x X P(x |x, a) max a A Q(x , a ), x X, a A. (5) The formulation is based on the LP formulation of MDP (see Puterman (2014) for more details). Here the distribution p(x, a) is given by the data-generating distribution of the replay buffer B. (We assume that the replay buffer is large enough such that it consists of experience from almost all state-action pairs.) It is well-known that one can transform the above constrained optimization problem into an unconstrained one by applying a penalty-based approach (to the constraints). For simplicity, here we stick with a single constant penalty parameter λ 0 (instead of going for a state-action Lagrange multiplier and maximizing that), and a hinge penalty function ( )+. With a given penalty hyper-parameter λ 0 (that can be separately optimized), we propose finding the optimal Q-function by solving the following optimization problem: x X,a A p(x, a)Q(x, a) + λ R(x, a)+γ X x X P(x |x, a)max a A Q(x , a ) Q(x, a) Furthermore, recall that in many off-policy and offline RL algorithms (such as DQN), samples in form of {(xi, ai, ri, x i)}|B| i=1 are independently drawn from the replay buffer, and instead of the optimizing the original objective function, one goes for its unbiased sample average approximation (SAA). However, viewing from the objective function of problem (6), finding an unbiased SAA for this problem might be challenging, due to the non-linearity of hinge penalty function ( )+. Therefore, alternatively we turn to study the following unconstrained optimization problem: x X,a A p(x, a)Q(x, a) + λ X x X P(x |x, a) R(x, a) + γ max a A Q(x , a ) Q(x, a) Using the Jensen s inequality for convex functions, one can see that the objective function in (7) is an upper-bound of that in (6). Equality of the Jensen s inequality will hold in the case when transition function is deterministic. (This is similar to the argument of PCL algorithm.) Using Jensen s inequality one justifies that optimization problem (7) is indeed an eligible upper-bound optimization to problem (6). Recall that p(x, a) is the data-generation distribution of the replay buffer B. The unbiased SAA of problem (7) is therefore given by s=1 Q(xi, ai) + λ ri + γ max a A Q(x i, a ) Q(xi, ai) where {(xi, ai, ri, x i)}N s=1 are the N samples drawn independently from the replay buffer. In the following, we will find the optimal Q function by solving this SAA problem. In general when the state and action spaces are large/uncountable, instead of solving the Q-function exactly (as in the tabular case), we turn to approximate the Q-function with its parametrized form Qθ, and optimize the set of real weights θ (instead of Q) in problem (8). B CONTINUOUS ACTION Q-LEARNING ALGORITHM Algorithm 1 Continuous Action Q-learning (CAQL) 1: Define the maximum training epochs T, episode length L, and training steps (per epoch) S 2: Initialize the Q-function parameters θ, target Q-function parameters θtarget, and action function parameters w 3: Choose a max-Q solver OPT {MIP, CEM, GA} (Section 3) 4: Select options for max-Q speed up (Section 4) using the following Boolean variables: (i) DTol: Dynamic Tolerance, (ii) DF: Dual Filtering, (iii) C: Clustering; Denote by OPT(Dtol) the final max-Q solver, constructed by the base solver and dynamic tolerance update rule (if used) 5: Initialize replay buffer R of states, actions, next states and rewards 6: for t 1, . . . , T do 7: Sample an initial state x0 from the initial distribution 8: for ℓ 0, . . . , L 1 do Online Data Collection 9: Select action aℓ= clip(πw(xℓ) + N(0, σ), l, u) 10: Execute action aℓand observe reward rℓand new state xℓ+1 11: Store transition (xℓ, aℓ, rℓ, xℓ+1) in Replay Buffer R 12: for s 1, . . . , S do CAQL Training; S = 20 by default 13: Sample a random minibatch B of |B| transitions {(xi, ai, ri, x i)}|B| i=1 from R 14: Initialize the refined batches B df B and B c B; (i) IF DF is True: B df {(x, a, r, x ) B : eqx ,θtarget > (Qθ(x, a) r)/γ} (ii) IF C is True: B c {(x, a, x , r) B : x Ct(b)} 15: For each (xi, ai, ri, x i) B df B c, compute optimal action a i using OPT(DTol): a i arg max a Qθ(x i, a ) 16: and the corresponding TD targets: qi = ri + γQθtarget(x i, a i) 17: For each (xi, ai, ri, x i) B \ (B c B df), compute the approximate TD target: (i) IF C is True: (xi, ai, ri, x i) B df \ B c, qi ri + γˆqx i,θtarget where ˆqx i,θtarget = q c i,θtarget + x Qθtarget(x , a )|x =c ,a =a c i, (x i c i) , and c i Ct(b) is the closest centroid to x i (ii) IF DF is True: (xi, ai, ri, x i) B \ B df, qi ri + γeqx i,θtarget 18: Update the Q-function parameters: θ arg min θ 1 |B| i=1 (Qθ(xi, ai) qi)2 19: Update the action function parameters: w arg min w 1 |B| i=1 (Qθ(x i, a i) Qθ(x i, πw(x i)))2 20: Update the target Q-function parameters: θtarget τθ + (1 τ)θtarget 21: Decay the Gaussian noise: σ λσ, λ [0, 1] C DETAILS OF DUAL FILTERING Recall that the Q-function NN has a nonlinear activation function, which can be viewed as a nonlinear equality constraint, according to the formulation in (3). To tackle this constraint, Wong & Kolter (2017) proposed a convex relaxation of the Re LU non-linearity. Specifically, first, they assume that for given x X and a B (a) such that z1 = (x , a ), there exists a collection of component-wise bounds (lj, uj), j = 2, . . . , K 1 such that lj ˆzj uj. As long as the bounds are redundant, adding these constraints into primal problem q x does not affect the optimal value. Second, the Re LU non-linear equality constraint is relaxed using a convex outer-approximation. In particular, for a scalar input a within the real interval [l, u], the exact Re LU non-linearity acting on a is captured by the set H(l, u) := {(h, k) R2 : h [l, u], k = [h]+}. Its convex outer-approximation is given by: H(l, u) := {(h, k) R2 : k h, k 0, uh + (u l)k ul}. (9) Analogously to (3), define the relaxed NN equations as: z1 = (x , a ), a B (a, ) (10a) ˆzj = Wj 1zj 1 + bj 1, j = 2, . . . , K (10b) (ˆzj, zj) H(lj, uj), j = 2, . . . , K 1, (10c) where the third equation above is understood to be component-wise across layer j for each j {2, . . . , K 1}, i.e., (ˆzj(s), zj(s)) H(lj(s), uj(s)), s = 1, . . . , nj, where nj is the dimension of hidden layer j. Using the relaxed NN equations, we now propose the following relaxed (convex) verification problem: q x := max ˆz,z c ˆz K + δB ( a)(a ) + j=2 δ H(lj,uj)(ˆzj, zj) (11a) s.t. ˆzj = Wj 1zj 1 + bj 1, j = 2, . . . , K, (11b) where δΛ( ) is the indicator function for set Λ (i.e., δΛ(x) = 0 if x Λ and otherwise). Note that the indicator for the vector-Re LU in cost function above is understood to be component-wise, i.e., δ H(lj,uj)(ˆzj, zj) = s=1 δ H(lj(s),uj(s))(ˆzj(s), zj(s)), j = 2, . . . , K 1. The optimal value of the relaxed problem, i.e., q x is an upper bound on the optimal value for original problem q x i. Thus, the certification one can obtain is the following: if q x (Qθ(x, a) r)/γ, then the sample (x, a, x ) is discarded for inner maximization. However, if q x > (Qθ(x, a) r)/γ, the sample (x, a, x ) may or may not have any contribution to the TD-error in the hinge loss function. To further speed up the computation of the verification problem, by looking into the dual variables of problem (11), in the next section we propose a numerically efficient technique to estimate a suboptimal, upper-bound estimate to q x , namely eqx . Therefore, one verification criterion on whether a sample drawn from replay buffer should be discarded for inner-maximization is check whether the following inequality holds: eqx Qθ(x, a) r C.1 SUB-OPTIMAL SOLUTION TO THE RELAXED PROBLEM In this section, we detail the sub-optimal lower bound solution to the relaxed problem in (11) as proposed in Wong & Kolter (2017). Let νj, j = 2, . . . , K denote the dual variables for the linear equality constraints in problem (11). The Lagrangian for the relaxed problem in (11) is given by: L(ˆz, z, ν) = c ˆz K + ν K ˆz K + δB ( a)(a ) ν 2 W1z1 + δ H(lj,uj)(ˆzj, zj) + ν j ˆzj ν j+1Wjzj j=1 ν j+1bj. Define ˆνj := W j νj+1 for j = 1, . . . , K 1, and define ˆνx 1 = (W x 1 ) ν2, ˆνa 1 = (W a 1 ) ν2. Then, given the decoupled structure of L in (ˆzj, zj), minimizing L w.r.t. (ˆz, z) yields the following dual function: δ B (a)(ˆνa 1 ) (ˆνx 1 ) x PK 1 j=2 δ H(lj,uj) PK 1 j=1 ν i+1bi if νK = c Recall that for a real vector space X Rn, let X denote the dual space of X with the standard pairing , : X X R. For a real-valued function f : X R { , }, let f : X R { , } be its convex conjugate, defined as: f (y) = infx X (f(x) y, x ) = supx X ( y, x f(x)), y X . Therefore, the conjugate for the vector-Re LU indicator above takes the following component-wise structure: s=1 δ H(lj(s),uj(s))( νj(s), ˆνj(s)), j = 2, . . . , K 1. (14) Now, the convex conjugate of the set indicator function is given by the set support function. Thus, δ B (a)(ˆνa 1 ) = a ˆνa 1 + ˆνa 1 q, where q is the lp-dual norm defined by the identity 1/p + 1/q = 1. To compute the convex conjugate for the Re LU relaxation, we analyze the scalar definition as provided in (9). Specifically, we characterize δ H(l,u)(p, q) defined by the scalar bounds (l, u), for the dual vector (p, q) R2. There exist 3 possible cases: Case I: l < u 0: δ H(l,u)(p, q) = p u if p > 0 p l if p < 0 0 if p = 0. (15) Case II: 0 l < u: δ H(l,u)(p, q) = (p + q) u if p + q > 0 (p + q) l if p + q < 0 0 if p = q. (16) Case III: l < 0 < u: For this case, the sup will occur either on the line ux + (u l)y = ul or at the origin. Thus, δ H(l,u)(p, q) = sup h [l,u] p h + u u l (h l) q sup h [l,u] p + q u u l h q l u u l [(p + q) u]+ if p + q u u l [p l]+ if p + q u u l < 0 q l u u l + = [p l]+ if p + q u u l Applying these in context of equation (14), we calculate the Lagrange multipliers by considering the following cases. Case I: lj(s) < uj(s) 0: In this case, since zj(s) = 0 regardless of the value of ˆzj(s), one can simply remove these variables from problems (10) and (11) by eliminating the jth row of Wj 1 and bj 1 and the jth column of Wi. Equivalently, from (15), one can remove their contribution in (13) by setting νj(s) = 0. Case II: 0 lj(s) < uj(s): In this case, the Re LU non-linearity for (ˆzj(s), zj(s)) in problems (10) and (11) may be replaced with the convex linear equality constraint zj(s) = ˆzj(s) with associated dual variable µ. Within the Lagrangian, this would result in a modification of the term δ H(lj(s),uj(s))(ˆzj(s), zj(s)) + νj(s)ˆzj(s) ˆνj(s)zj(s) to µ(zj(s) ˆzj(s)) + νj(s)ˆzj(s) ˆνj(s)zj(s). Minimizing this over (ˆzj(s), zj(s)), a non-trivial lower bound (i.e., 0) is obtained only if νj(s) = ˆνj(s) = µ. Equivalently, from (16), we set νj(s) = ˆνj(s). Case III: For the non-trivial third case, where lj(s) < 0 < uj(s), notice that due to ˆν, the dual function g(ν) is not decoupled across the layers. In order to get a sub-optimal, but analytical solution to the dual optimization, we will optimize each term within the first sum in (13) independently. To do this, notice that the quantity in sub-case I in (17) is strictly greater than the other two sub-cases. Thus, the best bound is obtained by using the third sub-case, which corresponds to setting: νj(s) = ˆνj(s) uj(s) uj(s) lj(s). Combining all the previous analysis, we now calculate the dual of the solution to problem (11). Let I j := {s Nnj : uj(s) 0}, I+ j := {s Nnj : lj(s) 0}, and Ij := Nnj \ (I j I+ j ). Using the above case studies, a sub-optimal (upper-bound) dual solution to the primal solution Jx in problem (11) is given by eqx := (ˆνa 1 ) a ˆνa 1 q (ˆνx 1 ) x + s Ij lj(s)[νj(s)]+ j=1 ν j+1bi, (18) where ν is defined by the following recursion, termed the dual network: νK := c, ˆνj := W j νj+1, j = 1, . . . , K 1, νj := Djˆνj, j = 2, . . . , K 1, (19) and Dj is a diagonal matrix with [Dj](s, s) = 0 if s I j 1 if s I+ j uj(s) uj(s) lj(s) if s Ij. (20) C.2 COMPUTING PRE-ACTIVATION BOUNDS For k {3, . . . , K 1}, define the k partial NN as the set of equations: z1 =(x , a ), ˆzj =Wj 1zj 1 + bj 1, j = 2, . . . , k, zj =h(ˆzj), j = 2, . . . , k 1. (21) Finding the lower bound lk for ˆzk involves solving the following problem: min ˆz,z e s ˆzk (22) s.t. z1 = (x , a ), a B (a, ), eq. (21), (23) where es is a one-hot vector with the non-zero element in the s-th entry, for s {1, . . . , nk}. Similarly, we obtain uk by maximizing the objective above. Assuming we are given bounds {lj, uj}k 1 j=2, we can employ the same convex relaxation technique and approximate dual solution as for the verification problem (since we are simply optimizing a linear function of the output of the first k layers of the NN). Doing this recursively allows us to compute the bounds {lj, uj} for j = 3, . . . , K 1. The recursion is given in Algorithm 1 in Wong & Kolter (2017) and is based on the matrix form of the recursion in (19), i.e., with c replaced with I and I, so that the quantity in (18) is vector-valued. D EXPERIMENTAL DETAILS Environment State dimension Action dimension Action ranges Pendulum 3 1 [-2, 2], [-1, 1], [-0.66, 0.66] Hopper 11 3 [-1, 1], [-0.5, 0.5], [-0.25, 0.25] Walker2D 17 6 [-1, 1], [-0.5, 0.5], [-0.25, 0.25] Half Cheetah 17 6 [-1, 1], [-0.5, 0.5], [-0.25, 0.25] Ant 111 8 [-1.0, 1.0], [-0.5, 0.5], [-0.25, 0.25], [-0.1, 0.1] Humanoid 376 17 [-0.4, 0.4], [-0.25, 0.25], [-0.1, 0.1] Table 6: Benchmark Environments. Various action bounds are tested from the default one to smaller ones. The action range in bold is the default one. For high-dimensional environments such as Walker2D, Half Cheetah, Ant, and Humanoid, we only test on action ranges that are smaller than the default (in bold) due to the long computation time for MIP. A smaller action bound results in a MIP that solves faster. Hyper Parameters for CAQL and NAF Value(s) Discount factor 0.99 Exploration policy N(0, σ = 1) Exploration noise (σ) decay 0.9995, 0.9999 Exploration noise (σ) minimum 0.01 Soft target update rate (τ) 0.001 Replay memory size 105 Mini-batch size 64 Q-function learning rates 0.001, 0.0005, 0.0002, 0.0001 Action function learning rates (for CAQL only) 0.001, 0.0005, 0.0002, 0.0001 Tolerance decay (for dynamic tolerance) 0.995, 0.999, 0.9995 Lambda penalty (for CAQL with Hinge loss) 0.1, 1.0, 10.0 Neural network optimizer Adam Table 7: Hyper parameters settings for CAQL(+ MIP, GA, CEM) and NAF. We sweep over the Q-function learning rates, action function learning rates, and exploration noise decays. Hyper Parameters for DDPG, TD3, SAC Value(s) Discount factor 0.99 Exploration policy (for DDPG and TD3) N(0, σ = 1) Exploration noise (σ) decay (for DDPG and TD3) 0.9995 Exploration noise (σ) minimum (for DDPG and TD3) 0.01 Temperature (for SAC) 0.99995, 0.99999 Soft target update rate (τ) 0.001 Replay memory size 105 Mini-batch size 64 Critic learning rates 0.001, 0.0005, 0.0002, 0.0001 Actor learning rates 0.001, 0.0005, 0.0002, 0.0001 Neural network optimizer Adam Table 8: Hyper parameters settings for DDPG, TD3, and SAC. We sweep over the critic learning rates, actor learning rates, temperature,and exploration noise decays. We use a two hidden layer neural network with Re LU activation (32 units in the first layer and 16 units in the second layer) for both the Q-function and the action function. The input layer for the Q-function is a concatenated vector of state representation and action variables. The Q-function has a single output unit (without Re LU). The input layer for the action function is only the state representation. The output layer for the action function has d units (without Re LU), where d is the action dimension of a benchmark environment. We use SCIP 6.0.0 (Gleixner et al., 2018) for the MIP solver. A time limit of 60 seconds and a optimality gap limit of 10 4 are used for all experiments. For GA and CEM, a maximum iterations of 20 and a convergence threshold of 10 6 are used for all experiments if not stated otherwise. E ADDITIONAL EXPERIMENTAL RESULTS E.1 OPTIMIZER SCALABILITY Table 9 shows the average elapsed time of various optimizers computing max-Q in the experiment setup described in Appendix D. MIP is more robust to action dimensions than GA and CEM. MIP latency depends on the state of neural network weights. It takes longer time with highly dense NN weights, but on the other hand, it can be substantially quicker with sparse NN weights. Figure 3 shows the average elapsed time of MIP over training steps for various benchmarks. We have observed that MIP is very slow in the beginning of the training phase but it quickly becomes faster. This trend is observed for most benchmarks except Humanoid. We speculate that the NN weights for the Q-function are dense in the beginning of the training phase, but it is gradually structurized (e.g, sparser weights) so that it becomes an easier problem for MIP. Env. [Action range] GA CEM MIP Hopper [-1.0, 1.0] Med(κ): 0.093, SD(κ): 0.015 Med(κ): 0.375, SD(κ): 0.059 Med(κ): 83.515, SD(κ): 209.172 Walker2D [-0.5, 0.5] Med(κ): 0.093, SD(κ): 0.015 Med(κ): 0.406, SD(κ): 0.075 Med(κ): 104.968, SD(κ): 178.797 Half Cheetah [-0.5, 0.5] Med(κ): 0.093, SD(κ): 0.015 Med(κ): 0.296, SD(κ): 0.072 Med(κ): 263.453, SD(κ): 88.269 Ant [-0.25, 0.25] Med(κ): 0.109, SD(κ): 0.018 Med(κ): 0.343, SD(κ): 0.054 Med(κ): 87.640, SD(κ): 160.921 Humanoid [-0.25, 0.25] Med(κ): 0.140, SD(κ): 0.022 Med(κ): 0.640, SD(κ): 0.113 Med(κ): 71.171, SD(κ): 45.763 Table 9: The (median, standard deviation) for the average elapsed time κ (in msec) of various solvers computing max-Q problem. 0 100 200 300 400 500 Training Steps ( 1e3) Elapsed time for Max Q computation (msec) Half Cheetah [-0.5, 0.5] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.25, 0.25] Constant Dynamic 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.25, 0.25] Figure 3: Average elapsed time (in msec) for MIP computing max-Q with constant and dynamic optimality gap. E.2 ADDITIONAL RESULTS 0 50 100 150 200 Pendulum [-0.66, 0.66] 0 50 100 150 200 1750 Pendulum [-1.0, 1.0] 0 50 100 150 200 Pendulum [-2.0, 2.0] MIP GA CEM NAF DDPG TD3 SAC 0 50 100 150 200 400 Hopper [-0.25, 0.25] 0 50 100 150 200 0 Hopper [-0.5, 0.5] 0 50 100 150 200 Hopper [-1.0, 1.0] 0 50 100 150 200 Walker2D [-0.25, 0.25] 0 50 100 150 200 Walker2D [-0.5, 0.5] 0 100 200 300 400 500 Half Cheetah [-0.25, 0.25] 0 100 200 300 400 500 Half Cheetah [-0.5, 0.5] 0 100 200 300 400 500 Training Steps ( 1e3) 450 Ant [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.25, 0.25] 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.25, 0.25] (a) Mean cumulative reward with full standard deviation. 0 50 100 150 200 Pendulum [-0.66, 0.66] 0 50 100 150 200 Pendulum [-1.0, 1.0] 0 50 100 150 200 1750 Pendulum [-2.0, 2.0] MIP GA CEM NAF DDPG TD3 SAC 0 50 100 150 200 Hopper [-0.25, 0.25] 0 50 100 150 200 0 500 Hopper [-0.5, 0.5] 0 50 100 150 200 0 Hopper [-1.0, 1.0] 0 50 100 150 200 Walker2D [-0.25, 0.25] 0 50 100 150 200 Walker2D [-0.5, 0.5] 0 100 200 300 400 500 Half Cheetah [-0.25, 0.25] 0 100 200 300 400 500 Half Cheetah [-0.5, 0.5] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.25, 0.25] 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.25, 0.25] (b) Mean cumulative reward with half standard deviation. Figure 4: Mean cumulative reward of the best hyper parameter configuration over 10 random seeds. Shaded area is full or half standard deviation. Data points are average over a sliding window of size 6. The length of an episode is limited to 200 steps. 0 50 100 150 200 Pendulum [-0.66, 0.66] 0 50 100 150 200 1750 Pendulum [-1.0, 1.0] 0 50 100 150 200 0 Pendulum [-2.0, 2.0] MIP GA CEM NAF DDPG TD3 SAC 0 50 100 150 200 400 Hopper [-0.25, 0.25] 0 50 100 150 200 Hopper [-0.5, 0.5] 0 50 100 150 200 Hopper [-1.0, 1.0] 0 50 100 150 200 Walker2D [-0.25, 0.25] 0 50 100 150 200 Walker2D [-0.5, 0.5] 0 100 200 300 400 500 Half Cheetah [-0.25, 0.25] 0 100 200 300 400 500 Half Cheetah [-0.5, 0.5] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.25, 0.25] 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.25, 0.25] (a) Mean cumulative reward with full standard deviation. 0 50 100 150 200 1600 Pendulum [-0.66, 0.66] 0 50 100 150 200 Pendulum [-1.0, 1.0] 0 50 100 150 200 1750 Pendulum [-2.0, 2.0] MIP GA CEM NAF DDPG TD3 SAC 0 50 100 150 200 Hopper [-0.25, 0.25] 0 50 100 150 200 0 Hopper [-0.5, 0.5] 0 50 100 150 200 0 Hopper [-1.0, 1.0] 0 50 100 150 200 300 Walker2D [-0.25, 0.25] 0 50 100 150 200 Walker2D [-0.5, 0.5] 0 100 200 300 400 500 400 Half Cheetah [-0.25, 0.25] 0 100 200 300 400 500 600 Half Cheetah [-0.5, 0.5] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.25, 0.25] 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.25, 0.25] (b) Mean cumulative reward with half standard deviation. Figure 5: Mean cumulative reward over all 320 configurations (32 hyper parameter combinations 10 random seeds). Shaded area is full or half standard deviation. Data points are average over a sliding window of size 6. The length of an episode is limited to 200 steps. 0 200 400 600 800 1000 0 Hopper [-0.25, 0.25] 0 200 400 600 800 1000 Hopper [-1.0, 1.0] 0 500 1000 1500 2000 Walker2D [-0.25, 0.25] GA DDPG TD3 SAC 0 500 1000 1500 2000 Walker2D [-1.0, 1.0] 0 200 400 600 800 1000 Half Cheetah [-0.25, 0.25] 0 500 1000 1500 2000 Training Steps ( 1e3) Half Cheetah [-1.0, 1.0] 0 200 400 600 800 1000 Training Steps ( 1e3) Ant [-0.25, 0.25] 0 500 1000 1500 2000 Training Steps ( 1e3) Ant [-1.0, 1.0] 0 500 1000 1500 2000 Training Steps ( 1e3) Humanoid [-0.1, 0.1] 0 500 1000 1500 2000 Training Steps ( 1e3) Humanoid [-0.4, 0.4] (a) Mean cumulative reward with full standard deviation. 0 200 400 600 800 1000 0 Hopper [-0.25, 0.25] 0 200 400 600 800 1000 0 Hopper [-1.0, 1.0] 0 500 1000 1500 2000 0 Walker2D [-0.25, 0.25] GA DDPG TD3 SAC 0 500 1000 1500 2000 0 Walker2D [-1.0, 1.0] 0 200 400 600 800 1000 Half Cheetah [-0.25, 0.25] 0 500 1000 1500 2000 Training Steps ( 1e3) Half Cheetah [-1.0, 1.0] 0 200 400 600 800 1000 Training Steps ( 1e3) Ant [-0.25, 0.25] 0 500 1000 1500 2000 Training Steps ( 1e3) Ant [-1.0, 1.0] 0 500 1000 1500 2000 Training Steps ( 1e3) 4000 Humanoid [-0.1, 0.1] 0 500 1000 1500 2000 Training Steps ( 1e3) Humanoid [-0.4, 0.4] (b) Mean cumulative reward with half standard deviation. Figure 6: Mean cumulative reward of the best hyper parameter configuration over 10 random seeds. Shaded area is full or half standard deviation. Data points are average over a sliding window of size 6. The length of an episode is 1000 steps. E.3 ABLATION ANALYSIS 0 25 50 75 100 125 150 175 200 Walker2D [-1.0, 1.0] 0 25 50 75 100 125 150 175 200 0 Hopper [-1.0, 1.0] GA_L2 GA_DFC_L2 GA_DFC_0_25_L2 GA_DFC_0_5_L2 DUAL_L2 0 100 200 300 400 500 Humanoid [-0.4, 0.4] 0 100 200 300 400 500 Training Steps ( 1e3) Half Cheetah [-1.0, 1.0] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.5, 0.5] 0 25 50 75 100 125 150 175 200 Training Steps ( 1e3) Pendulum [-2.0, 2.0] Figure 7: Mean cumulative reward of the best hyper parameter configuration over 10 random seeds. Shaded area is standard deviation. Data points are average over a sliding window of size 6. The length of an episode is limited to 200 steps. 0 25 50 75 100 125 150 175 200 Walker2D [-1.0, 1.0] 0 25 50 75 100 125 150 175 200 0 Hopper [-1.0, 1.0] GA_L2 GA_DFC_L2 GA_DFC_0_25_L2 GA_DFC_0_5_L2 DUAL_L2 0 100 200 300 400 500 Humanoid [-0.4, 0.4] 0 100 200 300 400 500 Training Steps ( 1e3) Half Cheetah [-1.0, 1.0] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.5, 0.5] 0 25 50 75 100 125 150 175 200 Training Steps ( 1e3) Pendulum [-2.0, 2.0] Figure 8: Mean cumulative reward over all 320 configurations (32 hyper parameter combinations 10 random seeds). Shaded area is standard deviation. Data points are average over a sliding window of size 6. The length of an episode is limited to 200 steps. 0 25 50 75 100 125 150 175 200 Walker2D [-1.0, 1.0] 0 25 50 75 100 125 150 175 200 Hopper [-1.0, 1.0] GA_200_L2_TOL_MIN GA_200_L2_TOL_100 GA_200_L2_TOL_DYN_100 GA_200_L2_TOL_DYN_1_0 GA_200_L2_TOL_DYN_0_1 0 100 200 300 400 500 Humanoid [-0.4, 0.4] 0 100 200 300 400 500 Training Steps ( 1e3) Half Cheetah [-1.0, 1.0] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.5, 0.5] 0 25 50 75 100 125 150 175 200 Training Steps ( 1e3) Pendulum [-2.0, 2.0] Figure 9: Mean cumulative reward of the best hyper parameter configuration over 10 random seeds. Shaded area is standard deviation. Data points are average over a sliding window of size 6. The length of an episode is limited to 200 steps. 0 25 50 75 100 125 150 175 200 Walker2D [-1.0, 1.0] 0 25 50 75 100 125 150 175 200 Hopper [-1.0, 1.0] GA_200_L2_TOL_MIN GA_200_L2_TOL_100 GA_200_L2_TOL_DYN_100 GA_200_L2_TOL_DYN_1_0 GA_200_L2_TOL_DYN_0_1 0 100 200 300 400 500 700 Humanoid [-0.4, 0.4] 0 100 200 300 400 500 Training Steps ( 1e3) Half Cheetah [-1.0, 1.0] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.5, 0.5] 0 25 50 75 100 125 150 175 200 Training Steps ( 1e3) Pendulum [-2.0, 2.0] Figure 10: Mean cumulative reward over all 320 configurations (32 hyper parameter combinations 10 random seeds). Shaded area is standard deviation. Data points are average over a sliding window of size 6. The length of an episode is limited to 200 steps. 0 100 200 300 400 500 Training Steps ( 1e3) Half Cheetah [-0.5, 0.5] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.25, 0.25] CONSTANT DYNAMIC 0 100 200 300 400 500 Training Steps ( 1e3) 700 Humanoid [-0.1, 0.1] 0 100 200 300 400 500 Training Steps ( 1e3) Humanoid [-0.25, 0.25] Figure 11: Comparison of CAQL-MIP with or without dynamic optimality gap on the mean return over 10 random seeds. Shaded area is standard deviation. Data points are average over a sliding window of size 6. The length of an episode is limited to 200 steps. 0 25 50 75 100 125 150 175 200 Walker2D [-1.0, 1.0] 0 25 50 75 100 125 150 175 200 Hopper [-1.0, 1.0] GA_L2 GA_HINGE 0 100 200 300 400 500 Humanoid [-0.4, 0.4] 0 100 200 300 400 500 Training Steps ( 1e3) Half Cheetah [-1.0, 1.0] 0 100 200 300 400 500 Training Steps ( 1e3) Ant [-0.5, 0.5] 0 25 50 75 100 125 150 175 200 Training Steps ( 1e3) Pendulum [-2.0, 2.0] Figure 12: Mean cumulative reward of the best hyper parameter configuration over 10 random seeds. Shaded area is standard deviation. Data points are average over a sliding window of size 6. The length of an episode is limited to 200 steps. Env. [Action range] ℓ2 CAQL-GA Hinge CAQL-GA Pendulum [-2, 2] -244.488 497.467 -232.307 476.410 Hopper [-1, 1] 253.459 158.578 130.846 92.298 Walker2D [-1, 1] 207.916 100.669 173.160 106.138 Half Cheetah [-1, 1] 628.440 234.809 532.160 192.558 Ant [-0.5, 0.5] 256.804 90.335 199.667 84.116 Humanoid [-0.4, 0.4] 443.735 223.856 297.097 151.533 Table 10: The mean standard deviation of (95-percentile) final returns over all 320 configurations (32 hyper parameter combinations 10 random seeds). The full training curves are given in Figure 12.