# gendice_generalized_offline_estimation_of_stationary_values__f0284e77.pdf Published as a conference paper at ICLR 2020 GENDICE: GENERALIZED OFFLINE ESTIMATION OF STATIONARY VALUES Ruiyi Zhang1 , Bo Dai2 , Lihong Li2, Dale Schuurmans2 1Duke University, 2Google Research, Brain Team An important problem that arises in reinforcement learning and Monte Carlo methods is estimating quantities defined by the stationary distribution of a Markov chain. In many real-world applications, access to the underlying transition operator is limited to a fixed set of data that has already been collected, without additional interaction with the environment being available. We show that consistent estimation remains possible in this challenging scenario, and that effective estimation can still be achieved in important applications. Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions, derived from fundamental properties of the stationary distribution, and exploiting constraint reformulations based on variational divergence minimization. The resulting algorithm, Gen DICE, is straightforward and effective. We prove its consistency under general conditions, provide an error analysis, and demonstrate strong empirical performance on benchmark problems, including off-line Page Rank and off-policy policy evaluation. 1 INTRODUCTION Estimation of quantities defined by the stationary distribution of a Markov chain lies at the heart of many scientific and engineering problems. Famously, the steady-state distribution of a random walk on the World Wide Web provides the foundation of the Page Rank algorithm (Langville & Meyer, 2004). In many areas of machine learning, Markov chain Monte Carlo (MCMC) methods are used to conduct approximate Bayesian inference by considering Markov chains whose equilibrium distribution is a desired posterior (Andrieu et al., 2002). An example from engineering is queueing theory, where the queue lengths and waiting time under the limiting distribution have been extensively studied (Gross et al., 2018). As we will also see below, stationary distribution quantities are of fundamental importance in reinforcement learning (RL) (e.g., Tsitsiklis & Van Roy, 1997). Classical algorithms for estimating stationary distribution quantities rely on the ability to sample next states from the current state by directly interacting with the environment (as in on-line RL or MCMC), or even require the transition probability distribution to be given explicitly (as in Page Rank). Unfortunately, these classical approaches are inapplicable when direct access to the environment is not available, which is often the case in practice. There are many practical scenarios where a collection of sampled trajectories is available, having been collected off-line by an external mechanism that chose states and recorded the subsequent next states. Given such data, we still wish to estimate a stationary quantity. One important example is off-policy policy evaluation in RL, where we wish to estimate the value of a policy different from that used to collect experience. Another example is off-line Page Rank (OPR), where we seek to estimate the relative importance of webpages given a sample of the web graph. Motivated by the importance of these off-line scenarios, and by the inapplicability of classical methods, we study the problem of off-line estimation of stationary values via a stationary distribution corrector. Instead of having access to the transition probabilities or a next-state sampler, we assume only access to a fixed sample of state transitions, where states have been sampled from an unknown distribution and next-states are sampled according to the Markov chain s transition operator. The Equal contribution. Work done while interning at Google. Published as a conference paper at ICLR 2020 off-line setting is indeed more challenging than its more traditional on-line counterpart, given that one must infer an asymptotic quantity from finite data. Nevertheless, we develop techniques that still allow consistent estimation under general conditions, and provide effective estimates in practice. The main contributions of this work are: We formalize the problem of off-line estimation of stationary quantities, which captures a wide range of practical applications. We propose a novel stationary distribution estimator, Gen DICE, for this task. The resulting algorithm is based on a new dual embedding formulation for divergence minimization, with a carefully designed mechanism that explicitly eliminates degenerate solutions. We theoretically establish consistency and other statistical properties of Gen DICE, and empirically demonstrate that it achieves significant improvements on several behavior-agnostic offpolicy evaluation benchmarks and an off-line version of Page Rank. The methods we develop in this paper fundamentally extend recent work in off-policy policy evaluation (Liu et al., 2018; Nachum et al., 2019) by introducing a new formulation that leads to a more general, and as we will show, more effective estimation method. 2 BACKGROUND We first introduce off-line Page Rank (OPR) and off-policy policy evaluation (OPE) as two motivating domains, where the goal is to estimate stationary quantities given only off-line access to a set of sampled transitions from an environment. Off-line Page Rank (OPR) The celebrated Page Rank algorithm (Page et al., 1999) defines the ranking of a web page in terms of its asymptotic visitation probability under a random walk on the (augmented) directed graph specified by the hyperlinks. If we denote the World Wide Web by a directed graph G = (V, E) with vertices (web pages) v V and edges (hyperlinks) (v, u) E, Page Rank considers the random walk defined by the Markov transition operator v u: P (u|v) = (1 η) |v| 1(v,u) E + η |V | , (1) where |v| denotes the out-degree of vertex v and η [0, 1) is a probability of teleporting to any page uniformly. Define dt (v) := P (st = v|s0 µ0, i < t, si+1 P( |si)), where µ0 is the initial distribution over vertices, then the original Page Rank algorithm explicitly iterates for the limit ( limt dt (v) if γ = 1 (1 γ) P t=0 γtdt (v) if γ (0, 1) . (2) The classical version of this problem is solved by tabular methods that simulate Equation 1. However, we are interested in a more scalable off-line version of the problem where the transition model is not explicitly given. Instead, consider estimating the rank of a particular web page v from a large web graph, given only a sample D = {(v, u)i}N i=1 from a random walk on G as specified above. We would still like to estimate d(v ) based on this data. First, note that if one knew the distribution p by which any vertex v appeared in D, the target quantity could be re-expressed by a simple importance ratio d (v ) = Ev p h d(v) p(v)1v=v i . Therefore, if one had the correction ratio function τ (v) = d(v) an estimate of d (v ) can easily be recovered via d (v ) ˆp (v ) τ (v ), where ˆp (v ) is the empirical probability of v estimated from D. The main attack on the problem we investigate is to recover a good estimate of the ratio function τ. Policy Evaluation An important generalization of this stationary value estimation problem arises in RL in the form of policy evaluation. Consider a Markov Decision Process (MDP) M = S, A, P, R, γ, µ0 (Puterman, 2014), where S is a state space, A is an action space, P (s |s, a) denotes the transition dynamics, R is a reward function, γ (0, 1] is a discounted factor, and µ0 is the initial state distribution. Given a policy, which chooses actions in any state s according to the probability distribution π( |s), a trajectory β = (s0, a0, r0, s1, a1, r1, . . .) is generated by first sampling the initial state s0 µ0, and then for t 0, at π( |st), rt R(st, at), and st+1 P( |st, at). The value of a policy π is the expected per-step reward defined as: Average: R(π) := lim T 1 T +1E h PT t=0 rt i , Discounted: Rγ(π) := (1 γ)E [P t=0 γtrt] . (3) In the above, the expectation is taken with respect to the randomness in the state-action pair P (s |s, a) π (a |s ) and the reward R (st, at). Without loss of generality, we assume the limit exists for the average case, and hence R(π) is finite. Published as a conference paper at ICLR 2020 Behavior-agnostic Off-Policy Evaluation (OPE) An important setting of policy evaluation that often arises in practice is to estimate Rγ (π) or R (π) given a fixed dataset D = {(s, a, r, s )i}N i=1 P (s |s, a) p (s, a), where p (s, a) is an unknown distribution induced by multiple unknown behavior policies. This problem is different from the classical form of OPE, where it is assumed that a known behavior policy πb is used to collect transitions. In the behavior-agnostic scenario, however, typical importance sampling (IS) estimators (e.g., Precup et al., 2000) do not apply. Even if one can assume D consists of trajectories where the behavior policy can be estimated from data, it is known that that straightforward IS estimators suffer a variance exponential in the trajectory length, known as the curse of horizon (Jiang & Li, 2016; Liu et al., 2018). Let dπ t (s, a) = P (st = s, at = a|s0 µ0, i < t, ai π ( |si) , si+1 P( |si, ai)). The stationary distribution can then be defined as µπ γ (s, a) := ( lim T 1 T +1 PT t=0 dπ t (s, a) = limt dπ t (s, a) if γ = 1 (1 γ) P t=0 γtdπ t (s, a) if γ (0, 1) . (4) With this definition, R(π) and Rγ (π) can be equivalently re-expressed as Rγ(π) := Eµπ γ [R (s, a)] = Ep h µπ γ (s,a) p(s,a) R (s, a) i . (5) Here we see once again that if we had the correction ratio function τ (s, a) = µπ γ (s,a) p(s,a) a straightforward estimate of Rγ(π) could be recovered via Rγ(π) Eˆp [τ (s, a) R (s, a)], where ˆp (s, a) is an empirical estimate of p (s, a). In this way, the behavior-agnostic OPE problem can be reduced to estimating the correction ratio function τ, as above. We note that Liu et al. (2018) and Nachum et al. (2019) also exploit Equation 5 to reduce OPE to stationary distribution correction, but these prior works are distinct from the current proposal in different ways. First, the inverse propensity score (IPS) method of Liu et al. (2018) assumes the transitions are sampled from a single behavior policy, which must be known beforehand; hence that approach is not applicable in behavior-agnostic OPE setting. Second, the recent Dual DICE algorithm (Nachum et al., 2019) is also a behavior-agnostic OPE estimator, but its derivation relies on a change-of-variable trick that is only valid for γ < 1. This previous formulation becomes unstable when γ 1, as shown in Section 6 and Appendix E. The behavior-agnostic OPE estimator we derive below in Section 3 is applicable both when γ = 1 and γ (0, 1). This connection is why we name the new estimator Gen DICE, for GENeralized stationary DIstribution Correction Estimation. As noted, there are important estimation problems in the Markov chain and MDP settings that can be recast as estimating a stationary distribution correction ratio. We first outline the conditions that characterize the correction ratio function τ, upon which we construct the objective for the Gen DICE estimator, and design efficient algorithm for optimization. We will develop our approach for the more general MDP setting, with the understanding that all the methods and results can be easily specialized to the Markov chain setting. 3.1 ESTIMATING STATIONARY DISTRIBUTION CORRECTION The stationary distribution µπ γ defined in Equation 4 can also be characterized via µ (s , a ) = (1 γ) µ0 (s ) π (a |s ) + γ Z π (a |s ) P (s |s, a) µ (s, a) ds da | {z } (T µ)(s ,a ) , (s , a ) S A. (6) At first glance, this equation shares a superficial similarity to the Bellman equation, but there is a fundamental difference. The Bellman operator recursively integrates out future (s , a ) pairs to characterize a current pair (s, a) value, whereas the distribution operator T defined in Equation 6 operates in the reverse temporal direction. When γ < 1, Equation 6 always has a fixed-point solution. For γ = 1, in the discrete case, the fixedpoint exists as long as T is ergodic; in the continuous case, the conditions for fixed-point existence become more complicated (Meyn & Tweedie, 2012) and beyond the scope of this paper. The development below is based on a divergence D and the following default assumption. Assumption 1 (Markov chain regularity) For the given target policy π, the resulting state-action transition operator T has a unique stationary distribution µ that satisfies D(T µ µ) = 0. Published as a conference paper at ICLR 2020 In the behavior-agnostic setting we consider, one does not have direct access to P for element-wise evaluation or sampling, but instead is given a fixed set of samples from P (s |s, a) p (s, a) with respect to some distribution p (s, a) over S A. Define T p γ,µ0 to be a mixture of µ0π and Tp; i.e., let T p γ,µ0 ((s , a ) , (s, a)) := (1 γ) µ0 (s ) π (a |s ) + γ π (a |s ) P (s |s, a) p (s, a) | {z } Tp((s ,a ),(s,a)) Obviously, conditioning on (s, a, s ) one could easily sample a π (a |s ) to form (s, a, s , a ) Tp ((s , a ) , (s, a)); similarly, a sample (s , a ) µ0 (s ) π (a |s ) could be formed from s . Mixing such samples with probability γ and 1 γ respectively yields a sample (s, a, s , a ) T p γ,µ0 ((s , a ) , (s, a)). Based on these observations, the stationary condition for the ratio from Equation 6 can be re-expressed in terms of T p γ,µ0 as p (s , a ) τ (s , a ) = (1 γ) µ0 (s ) π (a |s ) + γ Z π (a |s ) P (s |s, a) p (s, a) τ (s, a) ds da | {z } (T p γ,µ0 τ )(s ,a ) where τ (s, a) := µ(s,a) p(s,a) is the correction ratio function we seek to estimate. One natural approach to estimating τ is to match the LHS and RHS of Equation 8 with respect to some divergence D ( ) over the empirical samples. That is, we consider estimating τ by solving the optimization problem min τ 0 D T p γ,µ0 τ p τ . (9) Although this forms the basis of our approach, there are two severe issues with this naive formulation that first need to be rectified: i) Degenerate solutions: When γ = 1, the operator T p γ=1,µ0 is invariant to constant rescaling: if τ = T p γ=1,µ0 τ then cτ = T p γ=1,µ0 (cτ ) for any c 0. Therefore, simply minimizing the divergence D T p γ=1,µ0 τ p τ cannot provide a desirable estimate of τ . In fact, in this case the trivial solution τ (s, a) = 0 cannot be eliminated. ii) Intractable objective: The divergence D T p γ,µ0 τ p τ involves the computation of T p γ,µ0 τ, which in general involves an intractable integral. Thus, evaluation of the exact objective is intractable, and neglects the assumption that we only have access to samples from T p γ,µ0 and are not able to evaluate it at arbitrary points. We address each of these two issues in a principled manner. 3.2 ELIMINATING DEGENERATE SOLUTIONS To avoid degenerate solutions when γ = 1, we ensure that the solution is a proper density ratio; that is, the property τ Ξ := {τ ( ) 0, Ep [τ] = 1} must be true of any τ that is a ratio of some density to p. This provides an additional constraint that we add to the optimization formulation min τ 0 D T p γ,µ0 τ p τ , s.t., Ep [τ] = 1. (10) With this additional constraint, it is obvious that the trivial solution τ (s, a) = 0 is eliminated as an infeasible point of Eqn (10), along with other degenerate solutions τ (s, a) = cτ (s, a) with c = 1. Unfortunately, exactly solving an optimization with expectation constraints is very complicated in general (Lan & Zhou, 2016), particularly given a nonlinear parameterization for τ. The penalty method (Luenberger & Ye, 2015) provides a much simpler alternative, where a sequence of regularized problems are solved min τ 0 J (τ) := D T p γ,µ0 τ p τ + λ 2 (Ep [τ] 1)2 , (11) with λ increasing. The drawback of the penalty method is that it generally requires λ to ensure the strict feasibility, which is still impractical, especially in stochastic gradient descent. The infinite λ may induce unbounded variance in the gradient estimator, and thus, divergence in optimization. However, by exploiting the special structure of the solution sets to Equation 11, we can show that, remarkably, it is unnecessary to increase λ. Theorem 1 For γ (0, 1] and any λ > 0, the solution to Equation 11 is given by τ (s, a) = u(s,a) The detailed proof for Theorem 1 is given in Appendix A.1. By Theorem 1, we can estimate the desired correction ratio function τ by solving only one optimization with an arbitrary λ > 0. Published as a conference paper at ICLR 2020 3.3 EXPLOITING DUAL EMBEDDING The optimization in Equation 11 involves the integrals T p γ,µ0 τ and Ep [τ] inside nonlinear loss functions, hence appears difficult to solve. Moreover, obtaining unbiased gradients with a naive approach requires double sampling (Baird, 1995). Instead, we bypass both difficulties by applying a dual embedding technique (Dai et al., 2016; 2018). In particular, we assume the divergence D is in the form of an f-divergence (Nowozin et al., 2016) Dφ T p γ,µ0 τ p τ := R p τ (s, a) φ (T p γ,µ0 τ)(s,a) where φ ( ) : R+ R is a convex, lower-semicontinuous function with φ (1) = 0. Plugging this into J (τ) in Equation 11 we can easily check the convexity of the objective Theorem 2 For an f-divergence with valid φ defining Dφ, the objective J (τ) is convex w.r.t. τ. The detailed proof is provided in Appendix A.2. Recall that a suitable convex function can be represented as φ (x) = maxf x f φ (f), where φ is the Fenchel conjugate of φ ( ). In particular, we have the representation 1 2x2 = maxu ux 1 2u2, which allows us to re-express the objective as J (τ) = R p τ (s , a ) maxf (T p γ,µ0 τ)(s ,a ) p τ(s ,a ) f φ (f) ds da + λ n maxu h u (Ep [τ] 1) u2 2 io . (12) Applying the interchangeability principle (Shapiro et al., 2014; Dai et al., 2016), one can replace the inner max in the first term over scalar f to maximize over a function f ( , ) : S A R min τ 0 max f:S A R,u R J (τ, u, f) = (1 γ) Eµ0π [f (s, a)] + γETp [τ (s, a) f (s , a )] Ep [τ (s, a) φ (f (s, a))] + λ Ep [uτ (s, a) u] u2 This yields the main optimization formulation, which avoids the aforementioned difficulties and is well-suited for practical optimization as discussed in Section 3.4. Remark (Other divergences): In addition to f-divergence, the proposed estimator Equation 11 is compatible with other divergences, such as the integral probability metrics (IPM) (M uller, 1997; Sriperumbudur et al., 2009), while retaining consistency. Based on the definition of the IPM, these divergences directly lead to min-max optimizations similar to Equation 13 with the identity function as φ ( ) and different feasible sets for the dual functions. Specifically, maximum mean discrepancy (MMD) (Smola et al., 2006) requires f Hk 1 where Hk denotes the RKHS with kernel k; the Dudley metric (Dudley, 2002) requires f BL 1 where f BL := f + f 2; and Wasserstein distance (Arjovsky et al., 2017) requires f 2 1. These additional requirements on the dual function might incur some extra difficulty in practice. For example, with Wasserstein distance and the Dudley metric, we might need to include an extra gradient penalty (Gulrajani et al., 2017), which requires additional computation to take the gradient through a gradient. Meanwhile, the consistency of the surrogate loss under regularization is not clear. For MMD, we can obtain a closed-form solution for the dual function, which saves the cost of the inner optimization (Gretton et al., 2012), but with the tradeoff of requiring two independent samples in each outer optimization update. Moreover, MMD relies on the condition that the dual function lies in some RKHS, which introduces additional kernel parameters to be tuned and in practice may not be sufficiently flexible compared to neural networks. 3.4 A PRACTICAL ALGORITHM We have derived a consistent stationary distribution correction estimator in the form of a min-max saddle point optimization Equation 13. Here, we present a practical instantiation of Gen DICE with a concrete objective and parametrization. We choose the χ2-divergence, which is an f-divergence with φ (x) = (x 1)2 and φ (y) = y+ y2 4 . The objective becomes Jχ2 (τ, u, f) = (1 γ) Eµ0π [f (s, a)] + γETp [τ (s, a) f (s , a )] Ep τ (s, a) f (s, a) + 1 4f 2 (s, a) + λ Ep [uτ (s, a) u] u2 There two major reasons for adopting χ2-divergence: i) In the behavior-agnostic OPE problem, we mainly use the ratio correction function for estimating b Ep [ˆτ (s, a) R (s, a)], which is an expectation. Recall that the error between the estimate and ground-truth can then be bounded by total variation, which is a lower bound of χ2-divergence. Published as a conference paper at ICLR 2020 ii) For the alternative divergences, the conjugate of the KL-divergence involves exp ( ), which may lead to instability in optimization; while the IPM variants introduce extra constraints on dual function, which may be difficult to be optimized. The conjugate function of χ2-divergence enjoys suitable numerical properties and provides squared regularization. We have provided an empirical ablation study that investigates the alternative divergences in Section 6.3. To parameterize the correction ratio τ and dual function f we use neural networks, τ (s, a) = nnwτ (s, a) and f (s, a) = nnwf (s, a), where wτ and wf denotes the parameters of τ and f respectively. Since the optimization requires τ to be non-negative, we add an extra positive neuron, such as exp ( ), log (1 + exp ( )) or ( )2 at the final layer of nnwτ (s, a). We empirically compare the different positive neurons in Section 6.3. For these representations, and unbiased gradient estimator (wτ ,u,wf )J (τ, u, f) can be obtained straightforwardly, as shown in Appendix B. This allows us to apply stochastic gradient descent to solve the saddle-point problem Equation 14 in a scalable manner, as illustrated in Algorithm 1. 4 THEORETICAL ANALYSIS We provide a theoretical analysis for the proposed Gen DICE algorithm, following a similar learning setting and assumptions to (Nachum et al., 2019). Assumption 2 The target stationary correction are bounded, τ C < . The main result is summarized in the following theorem. A formal statement, together with the proof, is given in Appendix C. Theorem 3 (Informal) Under mild conditions, with learnable F and H, the error in the objective between the Gen DICE estimate, ˆτ, to the solution τ (s, a) = u(s,a) p(s,a) is bounded by E [J (ˆτ) J (τ )] = e O ϵapprox (F, H) + 1 where E [ ] is w.r.t. the randomness in D and in the optimization algorithms, ϵopt is the optimization error, and ϵapprox (F, H) is the approximation induced by (F, H) for parametrization of (τ, f). The theorem shows that the suboptimality of Gen DICE s solution, measured in terms of the objective function value, can be decomposed into three terms: (1) the approximation error ϵapprox, which is controlled by the representation flexibility of function classes; (2) the estimation error due to sample randomness, which decays at the order of 1/ N; and (3) the optimization error, which arises from the suboptimality of the solution found by the optimization algorithm. As discussed in Appendix C, in special cases, this suboptimality can be bounded below by a divergence between ˆτ and τ , and therefore directly bounds the error in the estimated policy value. There is also a tradeoff between these three error terms. With more flexible function classes (e.g., neural networks) for F and H, the approximation error ϵapprox becomes smaller. However, it may increase the estimation error (through the constant in front of 1/ N) and the optimization error (by solving a harder optimization problem). On the other hand, if F and H are linearly parameterized, estimation and optimization errors tend to be smaller and can often be upper-bounded explicitly in Appendix C.3. However, the corresponding approximation error will be larger. 5 RELATED WORK Off-policy Policy Evaluation Off-policy policy evaluation with importance sampling (IS) has has been explored in the contextual bandits (Strehl et al., 2010; Dud ık et al., 2011; Wang et al., 2017), and episodic RL settings (Murphy et al., 2001; Precup et al., 2001), achieving many empirical successes (e.g., Strehl et al., 2010; Dud ık et al., 2011; Bottou et al., 2013). Unfortunately, ISbased methods suffer from exponential variance in long-horizon problems, known as the curse of horizon (Liu et al., 2018). A few variance-reduction techniques have been introduced, but still cannot eliminate this fundamental issue (Jiang & Li, 2016; Thomas & Brunskill, 2016; Guo et al., 2017). By rewriting the accumulated reward as an expectation w.r.t. a stationary distribution, Liu et al. (2018); Gelada & Bellemare (2019) recast OPE as estimating a correction ratio function, which significantly alleviates variance. However, these methods still require the off-policy data to be collected by a single and known behavior policy, which restricts their practical applicability. The only published algorithm in the literature, to the best of our knowledge, that solves agnostic-behavior off-policy evaluation is Dual DICE (Nachum et al., 2019). However, Dual DICE was developed for Published as a conference paper at ICLR 2020 discounted problems and its results become unstable when the discount factor approaches 1 (see below). By contrast, Gen DICE can cope with the more challenging problem of undiscounted reward estimation in the general behavior-agnostic setting. Note that standard model-based methods (Sutton & Barto, 1998), which estimate the transition and reward models directly then calculate the expected reward based on the learned model, are also applicable to the behavior-agnostic setting considered here. Unfortunately, model-based methods typically rely heavily on modeling assumptions about rewards and transition dynamics. In practice, these assumptions do not always hold, and the evaluation results can become unreliable. Markov Chain Monte Carlo Classical MCMC (Brooks et al., 2011; Gelman et al., 2013) aims at sampling from µπ by iteratively simulting from the transition operator. It requires continuous interaction with the transition operator and heavy computational cost to update many particles. Amortized SVGD (Wang & Liu, 2016) and Adversarial MCMC (Song et al., 2017; Li et al., 2019) alleviate this issue via combining with neural network, but they still interact with the transition operator directly, i.e., in an on-policy setting. The major difference of our Gen DICE is the learning setting: we only access the off-policy dataset, and cannot sample from the transition operator. The proposed Gen DICE leverages stationary density ratio estimation for approximating the stationary quantities, which distinct it from classical methods. Density Ratio Estimation Density ratio estimation is a fundamental tool in machine learning and much related work exists. Classical density ratio estimation includes moment matching (Gretton et al., 2008), probabilistic classification (Bickel et al., 2007), and ratio matching (Nguyen et al., 2008; Sugiyama et al., 2008; Kanamori et al., 2009). These classical methods focus on estimating the ratio between two distributions with samples from both of them, while Gen DICE estimates the density ratio to a stationary distribution of a transition operator, from which even one sample is difficult to obtain. Page Rank Yao & Schuurmans (2013) developed a reverse-time RL framework for Page Rank via solving a reverse Bellman equation, which is less sensitive to graph topology and shows faster adaptation with graph change. However, Yao & Schuurmans (2013) still considers the online manner, which is different with our OPR setting. 6 EXPERIMENTS In this section, we evaluate Gen DICE on OPE and OPR problems. For OPE, we use one or multiple behavior policies to collect a fixed number of trajectories at some fixed trajectory length. This data is used to recover a correction ratio function for a target policy π that is then used to estimate the average reward in two different settings: i) average reward; and ii) discounted reward. In both settings, we compare with a model-based approach and step-wise weighted IS (Precup et al., 2000). We also compare to Liu et al. (2018) (referred to as IPS here) in the Taxi domain with a learned behavior policy1. We specifically compare to Dual DICE (Nachum et al., 2019) in the discounted reward setting, which is a direct and current state-of-the-art baseline. For OPR, the main comparison is with the model-based method, where the transition operator is empirically estimated and stationary distribution recovered via an exact solver. We validate Gen DICE in both tabular and continuous cases, and perform an ablation study to further demonstrate its effectiveness. All results are based on 20 random seeds, with mean and standard deviation plotted. Our code is publicly available at https://github.com/zhangry868/Gen DICE. 6.1 TABULAR CASE Offline Page Rank on Graphs One direct application of Gen DICE is off-line Page Rank (OPR). We test Gen DICE on a Barabasi-Albert (BA) graph (synthetic), and two real-world graphs, Cora and Citeseer. Details of the graphs are given in Appendix D. We use the log KL-divergence between estimated stationary distribution and the ground truth as the evaluation metric, with the ground truth computed by an exact solver based on the exact transition operator of the graphs. We compared Gen DICE with model-based methods in terms of the sample efficiency. From the results in Figure 1, Gen DICE outperforms the model-based method when limited data is given. Even with 20k samples for a BA graph with 100 nodes, where a transition matrix has 10k entries, Gen DICE still 1We used the released implementation of IPS (Liu et al., 2018) from https://github.com/zt95/ infinite-horizon-off-policy-estimation. Published as a conference paper at ICLR 2020 shows better performance in the offline setting. This is reasonable since Gen DICE directly estimates the stationary distribution vector or ratio, while the model-based method needs to learn an entire transition matrix that has many more parameters. 0.5k 1k 2k 5k 10k 20k # of Samples log KL-Divergence BA Graph (100 Nodes) Model-Based Gen DICE (ours) 1k 2k 5k 10k 20k 50k # of Samples BA Graph (500 Nodes) 2k 5k 10k 20k 50k 100k # of Samples log KL-Divergence 2k 5k 10k 20k 50k 100k # of Samples Figure 1: Stationary Distribution Estimation on BA and real-world graphs. Each plot shows the log KLdivergence of Gen DICE and modelbased method towards the number of samples. Off-Policy Evaluation with Taxi We use a similar taxi domain as in Liu et al. (2018), where a grid size of 5 5 yields 2000 states in total (25 16 5, corresponding to 25 taxi locations, 16 passenger appearance status and 5 taxi status). We set the target policy to a final policy π after running tabular Q-learning for 1000 iterations, and set another policy π+ after 950 iterations as the base policy. The behavior policy is a mixture controlled by α as πb = (1 α)π + απ+. For the model-based method, we use a tabular representation for the reward and transition functions, whose entries are estimated from behavior data. For IS and IPS, we fit a policy via behavior cloning to estimate the policy ratio. In this specific setting, our methods achieve better results compared to IS, IPS and the model-based method. Interestingly, with longer horizons, IS cannot improve as much as other methods even with more data, while Gen DICE consistently improve and achieves much better results than the baselines. Dual DICE only works with γ < 1. Gen DICE is more stable than Dual DICE when γ becomes larger (close to 1), while still showing competitive performance for smaller discount factors γ. 100 200 400 800 Truncated Length 100 200 400 800 Truncated Length 0.99 0.995 0.999 0.9999 1.0 Discount Factor 0.0 0.2 0.4 0.6 0.8 Mixing Ratio 0.0 0.2 0.4 0.6 0.8 Mixing Ratio Model-Based Importance Sampling Dual DICE IPS Gen DICE (ours) Figure 2: Results on Taxi Domain. The plots show log MSE of the tabular estimator across different trajectory lengths, different discount factors and different behavior policies (x-axis). 6.2 CONTINUOUS CASE We further test our method for OPE on three control tasks: a discrete-control task Cartpole and two continuous-control tasks Reacher and Half Cheetah. In these tasks, observations (or states) are continuous, thus we use neural network function approximators and stochastic optimization. Since Dual DICE (Nachum et al., 2019) has shown the state-of-the-art performance on discounted OPE, we mainly compare with it in the discounted reward case. We also compare to IS with a learned policy via behavior cloning and a neural model-based method, similar to the tabular case, but with neural network as the function approximator. All neural networks are feed-forward with two hidden layers of dimension 64 and tanh activations. More details can be found in Appendix D. Due to limited space, we put the discrete control results in Appendix E and focus on the more challenging continuous control tasks. Here, the good performance of IS and model-based methods in Section 6.1 quickly deteriorates as the environment becomes complex, i.e., with a continuous action space. Note that Gen DICE is able to maintain good performance in this scenario, even when using function approximation and stochastic optimization. This is reasonable because of the difficulty of fitting to the coupled policy-environment dynamics with a continuous action space. Here we also empirically validate Gen DICE with off-policy data collected by multiple policies. As illustrated in Figure 3, all methods perform better with longer trajectory length or more trajectories. When α becomes larger, i.e., the behavior policies are closer to the target policy, all methods performs better, as expected. Here, Gen DICE demonstrates good performance both on averagereward and discounted reward cases in different settings. The right two figures in each row show the log MSE curve versus optimization steps, where Gen DICE achieves the smallest loss. In the discounted reward case, Gen DICE shows significantly better and more stable performance than the Published as a conference paper at ICLR 2020 strong baseline, Dual DICE. Figure 4 also shows better performance of Gen DICE than all baselines in the more challenging Half Cheetah domain. Figure 3: Results on Reacher. The left three plots in the first row show the log MSE of estimated average per-step reward over different numbers of trajectories, truncated lengths, and behavior policies (M1 and M2 mean off-policy set collected by multiple behavior policies with α = [0.0, 0.33] and α = [0.0, 0.33, 0.66]). The right two figures show the loss curves towards the optimization steps. Each plot in the second row shows the average reward case. 20 40 60 100 200 Truncated Length 20 40 60 100 200 Truncated Length 100 200 400 1000 5000 # of Trajectories 0.0 0.33 0.66 M1 M2 Different Behaviour Policies 0.0 0.33 0.66 M1 M2 Different Behaviour Policies Model-Based Importance Sampling Dual DICE Gen DICE (ours) Figure 4: Results on Half Cheetah. Plots from left to the right show the log MSE of estimated average per-step reward over different truncated lengths, numbers of trajectories, and behavior policies in discounted and average reward cases. 6.3 ABLATION STUDY Finally, we conduct an ablation study on Gen DICE to study its robustness and implementation sensitivities. We investigate the effects of learning rate, activation function, discount factor, and the specifically designed ratio constraint. We further demonstrate the effect of the choice of divergences and the penalty weight. Effects of the Learning Rate Since we are using neural network as the function approximator, and stochastic optimization, it is necessary to show sensitivity to the learning rate with {0.0001, 0.0003, 0.001, 0.003}, with results in Figure 5. When α = 0.33, i.e., the OPE tasks are relatively easier and Gen DICE obtains better results at all learning rate settings. However, when α = 0.0, i.e., the estimation becomes more difficult and only Gen DICE only obtains reasonable results with the larger learning rate. Generally, this ablation study shows that the proposed method is not sensitive to the learning rate, and is easy to train. Activation Function of Ratio Estimator We further investigate the effects of the activation function on the last layer, which ensure the non-negative outputs required for the ratio. To better understand which activation function will lead to stable trainig for the neural correction estimator, we empirically compare using i) ( )2; ii) log(1 + exp( )); and iii) exp( ). In practice, we use the ( )2 since it achieves low variance and better performance in most cases, as shown in Figure 5. Effects of Discount Factors We vary γ {0.95, 0.99, 0.995, 0.999, 1.0} to probe the sensitivity of Gen DICE. Specifically, we compare to Dual DICE, and find that Gen DICE is stable, while Dual DICE becomes unstable when the γ becomes large, as shown in Figure 6. Gen DICE is also more general than Dual DICE, as it can be applied to both the average and discounted reward cases. Effects of Ratio Constraint In Section 3, we highlighted the importance of the ratio constraint. Here we investigate the trivial solution issue without the constraint. The results in Figure 6 demonstrate the necessity of adding the constraint penalty, since a trivial solution prevents an accurate corrector from being recovered (green line in left two figures). Published as a conference paper at ICLR 2020 Figure 5: Results of ablation study with different learning rates and activation functions. The plots show the log MSE of estimated average per-step reward over training and different behavior policies. Figure 6: Results of ablation study with constraint penalty and discount factors. The left two figures show the effect of ratio constraint on estimating average per-step reward. The right three figures show the log MSE for average per-step reward over training and different discount factor γ. Log KL-Divergence JS 2 Wasserstein Hellinger MMD KL Log KL-Divergence 0.1 0.5 1.0 2.0 4.0 5.0 Figure 7: Results of ablation study with (a) different divergence and (b) weight of penalty λ. The plots show the log KL-Divergence of OPR on Barabasi-Albert graph. Effects of the Choice of Divergences We empirically test the Gen DICE with several other alternative divergences, e.g., Wasserstein-1 distance, Jensen-Shannon divergence, KL-divergence, Hellinger divergence, and MMD. To avoid the effects of other factors in the estimator, e.g., function parametrization, we focus on the offline Page Rank task on BA graph with 100 nodes and 10k offline samples. All the experiments are evaluated with 20 random trials. To ensure the dual function to be 1-Lipchitz, we add the gradient penalty. Besides, we use a learned Gaussian kernel in MMD, similar to Li et al. (2017). As we can see in Figure 7(a), the Gen DICE estimator is compatible with many different divergences. Most of the divergences, with appropriate extra techniques to handle the difficulties in optimization and carefully tuning for extra parameters, can achieve similar performances, consistent with phenomena in the variants of GANs (Lucic et al., 2018). However, KL-divergence is an outlier, performing noticeably worse, which might be caused by the ill-behaved exp ( ) in its conjugate function. The χ2divergence and JS-divergence are better, which achieve good performances with fewer parameters to be tuned. Effects of the Penalty Weight The results of different penalty weights λ are illustrated in Figure 7(b). We vary the λ [0.1, 5] with χ2-divergence. Within a large range of λ, the performances of the proposed Gen DICE are quite consistent, which justifies Theorem 1. The penalty multiplies with λ. Therefore, with λ increases, the variance of the stochastic gradient estimator also increases, which explains the variance increasing in large λ in Figure 7(b). In practice, λ = 1 is a reasonable choice for general cases. 7 CONCLUSION In this paper, we proposed a novel algorithm Gen DICE for general stationary distribution correction estimation, which can handle both the discounted and average stationary distribution given multiple behavior-agnostic samples. Empirical results on off-policy evaluation and offline Page Rank show the superiority of proposed method over the existing state-of-the-art methods. ACKNOWLEDGMENTS The authors would like to thank Ofir Nachum, the rest of the Google Brain team and the anonymous reviewers for helpful discussions and feedback. Published as a conference paper at ICLR 2020 Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I. Jordan. An introduction to MCMC for machine learning. Machine Learning, 50(1-2):5 43, 2002. Andr as Antos, Csaba Szepesv ari, and R emi Munos. Learning near-optimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89 129, 2008. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875, 2017. Francis R. Bach. Breaking the curse of dimensionality with convex neural networks. Co RR, abs/1412.8690, 2014. Leemon Baird. Residual algorithms: reinforcement learning with function approximation. In Proc. Intl. Conf. Machine Learning, pp. 30 37. Morgan Kaufmann, 1995. Steffen Bickel, Michael Br uckner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pp. 81 88. ACM, 2007. L eon Bottou, Jonas Peters, Joaquin Qui nonero-Candela, Denis Xavier Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14:3207 3260, 2013. Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of markov chain monte carlo. CRC press, 2011. Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual embeddings. Co RR, abs/1607.04579, 2016. Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. SBEED: Convergent reinforcement learning with nonlinear function approximation. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 1133 1142, 2018. Miroslav Dud ık, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 1097 1104, 2011. R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 2002. Carles Gelada and Marc G Bellemare. Off-policy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3647 3655, 2019. Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 2013. A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, and A. Smola. A kernel two-sample test. JMLR, 13:723 773, 2012. Arthur Gretton, Alexander J. Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Sch olkopf. Dataset shift in machine learning. In J. Qui nonero-Candela, M. Sugiyama, A. Schwaighofer, and N. Lawrence (eds.), Covariate Shift and Local Learning by Distribution Matching, pp. 131 160. MIT Press, 2008. Donald Gross, Carl M. Harris, John F. Shortle, and James M. Thompson. Fundamentals of Queueing Theory. Wiley Series in Probability and Statistics. Wiley, 5th edition, 2018. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767 5777, 2017. Published as a conference paper at ICLR 2020 Zhaohan Guo, Philip S Thomas, and Emma Brunskill. Using options and covariance testing for long horizon off-policy policy evaluation. In Advances in Neural Information Processing Systems, pp. 2492 2501, 2017. D. Haussler. Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik Chervonenkis dimension. J. Combinatorial Theory (A), 69(2):217 232, 1995. Was University of CA at Santa Cruz TR UCSC-CRL-91-41. Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 652 661, 2016. Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. ar Xiv preprint ar Xiv:1902.00618, 2019. Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10(Jul):1391 1445, 2009. Guanghui Lan and Zhiqiang Zhou. Algorithms for stochastic optimization with expectation constraints. ar Xiv preprint ar Xiv:1604.03887, 2016. Amy N. Langville and Carl D. Meyer. Deeper inside Page Rank. Internet Mathematics, 2004. Alessandro Lazaric, Mohammad Ghavamzadeh, and R emi Munos. Finite-sample analysis of leastsquares policy iteration. Journal of Machine Learning Research, 13(Oct):3041 3074, 2012. Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabas Poczos. Mmd gan: Towards deeper understanding of moment matching network. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 2203 2213. Curran Associates, Inc., 2017. Chunyuan Li, Ke Bai, Jianqiao Li, Guoyin Wang, Changyou Chen, and Lawrence Carin. Adversarial learning of a sampler based on an unnormalized distribution. ar Xiv preprint ar Xiv:1901.00612, 2019. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. Qihang Lin, Mingrui Liu, Hassan Rafique, and Tianbao Yang. Solving weakly-convex-weaklyconcave saddle-point problems as weakly-monotone variational inequality. ar Xiv preprint ar Xiv:1810.10207, 2018. Tianyi Lin, Chi Jin, and Michael I. Jordan. On gradient descent ascent for nonconvex-concave minimax problems. Co RR, abs/1906.00331, 2019. Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinitehorizon off-policy estimation. In Advances in Neural Information Processing Systems 31 (NIPS), pp. 5361 5371, 2018. Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 700 709, 2018. David G. Luenberger and Yinyu Ye. Linear and Nonlinear Programming. Springer Publishing Company, Incorporated, 2015. ISBN 3319188410, 9783319188416. Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. Published as a conference paper at ICLR 2020 A. M uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997. Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 2001. Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dual DICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, 2019. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. on Optimization, 19(4):1574 1609, January 2009. ISSN 1052-6234. X.L. Nguyen, M. Wainwright, and M. Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in Neural Information Processing Systems 20, pp. 1089 1096. MIT Press, Cambridge, MA, 2008. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271 279, 2016. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Info Lab, 1999. David Pollard. Convergence of stochastic processes. Springer Science & Business Media, 2012. Doina Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In Proc. Intl. Conf. Machine Learning, pp. 759 766. Morgan Kaufmann, San Francisco, CA, 2000. Doina Precup, Richard S. Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with funtion approximation. In Proceedings of the 18th Conference on Machine Learning (ICML), pp. 417 424, 2001. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY, USA, 2014. ISBN 1107057132, 9781107057135. Alexander Shapiro, Darinka Dentcheva, et al. Lectures on stochastic programming: modeling and theory, volume 16. SIAM, 2014. A. J. Smola, A. Gretton, and K. Borgwardt. Maximum mean discrepancy. In Proc. Annual Conf. Computational Learning Theory, 2006. submitted. Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for mcmc. In Advances in Neural Information Processing Systems, pp. 5140 5150, 2017. B. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schoelkopf, and G. Lanckriet. On integral probability metrics, φ-divergences and binary classification. Technical Report ar Xiv:0901.2698v4, ar Xiv, 2009. Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems, pp. 2217 2225, 2010. M. Sugiyama, S. Nakajima, H. Kashima, P. von B unau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems 20, pp. 1433 1440, Cambridge, MA, 2008. Richard S Sutton, Csaba Szepesv ari, Alborz Geramifard, and Michael P Bowling. Dynastyle planning with linear function approximation and prioritized sweeping. ar Xiv preprint ar Xiv:1206.3285, 2012. Published as a conference paper at ICLR 2020 R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 2139 2148, 2016. John N. Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674 690, 1997. Dilin Wang and Qiang Liu. Learning to draw samples: With application to amortized mle for generative adversarial learning. ar Xiv preprint ar Xiv:1611.01722, 2016. Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and adaptive off-policy evaluation in contextual bandits. In ICML, 2017. Hengshuai Yao and Dale Schuurmans. Reinforcement ranking. ar Xiv preprint ar Xiv:1303.5988, 2013. B. Yu. Rates of convergence of empirical processes for mixing sequences. Annals of Probability, 22(1):94 116, 1994. A PROPERTIES OF GENDICE For notation simplicity, we denote x = (s, a) Ω:= S A and Pπ (x |x) := π (a |s ) P (s |s, a). Also define f p,2 := f, f p = R f (x)2 p (x) dx. We make the following assumption to ensure the existence of the stationary distribution. Our discussion is all based on this assumption. Assumption 1 Under the target policy, the resulted state-action transition operator T has a unique stationary distribution in terms of the divergence D ( || ). If the total variation divergence is selected, the Assumption 1 requires the transition operator should be ergodic, as discussed in Meyn & Tweedie (2012). A.1 CONSISTENCY OF THE ESTIMATOR Theorem 1 For arbitrary λ > 0, the solution to the optimization Eqn (11) is u(s,a) p(s,a) for γ (0, 1]. Proof For γ (0, 1), there is not degenerate solutions to D T p γ,µ0 τ ||p τ . The optimal solution is a density ratio. Therefore, the extra penalty Ep(x) [τ (x)] 1 2 does not affect the optimality for λ > 0. When γ = 1, for λ > 0, recall both D T p γ,µ0 τ ||p τ and Ep(x) [τ (x)] 1 2 are nonnegative, and the the density ratio µ(x) p(x) leads to zero for both terms. Then, the density ratio is a solution to J (τ). For any other non-negative function τ (x) 0, if it is the optimal solution to J (τ), then, we have D T p γ,µ0 τ ||p τ = 0 p (x ) τ (x ) = T p γ,µ0 τ (x ) = Z Pπ (x |x) τ (x) dx,(15) Ep(x) [τ (x)] 1 2 = 0 Ep(x) [τ (x)] = 1. (16) We denote µ (x) = p (x) τ (x), which is clearly a density function. Then, the optimal conditions in Equation 15 imply µ (x ) = Z Pπ (x |x) µ (x) dx, or equivalently, µ is the stationary distribution of T . We have thus shown the optimal τ (x) = µ(x) p(x) is the target density ratio. A.2 CONVEXITY OF THE OBJECTIVE Proof Since the φ is convex, we consider the Fenchel dual representation of the f-divergence Dφ T p γ,µ0 τ ||p τ , i.e., Dφ T p γ,µ0 τ ||p τ = max f Ω R ℓ(τ, f) := (1 γ) Eµ0π [f (x)] + γETp(x,x ) [τ (x) f (x )] Ep(x) [τ (x) φ (f (x))] . (17) It is obviously ℓ(τ, f) is convex in τ for each f, then, Dφ T p γ,µ0 τ ||p τ is convex. The term λ (Ep (τ) 1)2 is also convex, which concludes the proof. Published as a conference paper at ICLR 2020 Algorithm 1 Gen DICE (with function approximators) Inputs: Convex function φ and its Fenchel conjugate φ , off-policy data D = {(s(i), a(i), r(i), s (i))}N i=1, initial state s0 µ0, target policy π, distribution corrector nnwτ ( , ), nnwf ( , ), constraint scalar u, learning rates ητ, ηf, ηu, number of iterations K, batch size B. for t = 1, . . . , K do Sample batch {(s(i), a(i), r(i), s (i))}B i=1 from D. Sample batch {s(i) 0 }B i=1 from µ0. Sample actions a (i) π(s (i)), for i = 1, . . . , B. Sample actions a(i) 0 π(s(i) 0 ), for i = 1, . . . , B. Compute empirical loss ˆJχ2 (τ, u, f) = (1 γ) Eµ0π [f (s, a)] + γETp [τ (s, a) f (s , a )] Ep τ (s, a) f (s, a) + 1 4f 2 (s, a) + λ Ep [uτ (s, a) u] u2 Update wτ wτ ητ θτ ˆJχ2. Update wf wf + ηf θf ˆJχ2. Update u u + ηu u ˆJχ2. end for Return nnwτ . B ALGORITHM DETAILS We provide the unbiased gradient estimator for wτ ,u,wf J (τ, u, f) in Eqn (14) below: wτ Jχ2 (τ, u, f) = γETp [ wτ τ (s, a) f (s , a )] Ep wτ τ (s, a) f (s, a) + 1 4f 2 (s, a) +λu Ep [ wτ τ (s, a)], (18) u Jχ2 (τ, u, f) = λ (Ep [τ (s, a) 1] u) , (19) wf Jχ2 (τ, u, f) = (1 γ) Eµ0π wf f (s, a) + γETp τ (s, a) wf f (s , a ) τ (s, a) 1 + 1 2f (s, a) wf f (s, a) . Then, we have the psuedo code which applies SGD for solving Eqn (14). C PROOF OF THEOREM 3 For convenience, we repeat here the notation defined in the main text. The saddle-point reformulation of the objective function of Gen DICE is: J (τ, u, f) := (1 γ) Eµ0π [f (x )] + γETp(x,x ) [τ (x) f (x )] Ep(x) [τ (x) φ (f (x))] + λ Ep(x) [uτ (x) u] 1 To avoid the numerical infinity in Dφ ( || ), we induced the bounded version as J (τ) := max f C,u J (τ, u, f) = DC φ T p γ,µ0 τ ||p τ + λ 2 Ep(x) [τ (x)] 1 2 , in which DC φ ( || ) is still a valid divergence, and therefore the optimal solution τ is still the sta- tionary density ratio µ(x) p(x) . We denote the b J (τ, µ, f) as the empirical surrogate of J (τ, µ, f) on samples D = (x, x )N i=1 T p γ,µ0 (x, x ) with the optimal solution in (H, F, R) as ˆτ H, bu , bf F . Furthermore, denote τ H = arg min τ H J (τ) , τ = arg min τ S A R J (τ) Published as a conference paper at ICLR 2020 with optimal (f , u ), and L (τ) = max f F,u R J (τ, u, f) , b L (τ) = max f F,u R b J (τ, u, f) . We apply some optimization algorithm for b J (τ, u, f) over space (H, F, R), leading to the output ˆτ, bu, bf . Under Assumption 2, we need only consider τ C, then, the corresponding dual u = Ep (τ) 1 u U := {|u| (C + 1)}. We choose the φ ( ) is a κ-Lipschitz continuous, then, the J (τ, u, f) is a CPπ,κ,λ = max γ Pπ p, + (1 γ) µ0π p p, + κ C, (C + 1) λ + 1 Lipschitz continuous function w.r.t. (f, u) with the norm (f, u) p,1 := R |f (x)| p (x) dx+|u|, and Cφ,C,λ := C + λ (C + 1) + maxt { C,C} ( φ (t)) -Lipschitz continuous function w.r.t. τ with the norm τ p,1 := R |τ (x)| p (x) dx. We consider the error between ˆτ and τ using standard arguments (Shalev-Shwartz & Ben-David, 2014; Bach, 2014), i.e., d (ˆτ, τ ) := J (ˆτ) J (τ ) . The discrepancy d (τ, τ ) 0 and d (τ, τ ) = 0 if and only if p τ is stationary distribution of T in the weak sense of Dφ ( || ). Remark: In some special cases, the suboptimality also implies the distance between ˆτ and τ . Specifically,for γ = 1, if the transition operator Pπ can be represented as Pπ = QΛQ 1 where Q denotes the (countable) eigenfunctions and Λ denotes the diagonal matrix with eigenvalues, the largest of which is 1. We consider φ ( ) as identity and f F := n span (Q) , f p,2 1 o , then the d (τ, τ ) will bounded from below by a metric between τ and τ . Particularly, we have Dφ T p γ,µ0 τ ||p τ = max f F ETp(x,x ) [τ (x) f (x )] Ep(x) [τ (x) f (x)] = τ Pπ τ p,2 . Rewrite τ = ατ + ζ, where ζ span Q\τ , then Dφ T p γ,µ0 τ ||p τ = ατ αPπ τ + ζ Pπ ζ p,2 = ζ Pπ ζ p,2 . Recall the optimality of τ , i.e., Dφ T p γ,µ0 τ ||p τ = 0, we have d (τ, τ ) = J (τ) ζ Pπ ζ p,2 := (τ τ ) p,2,(Pπ I) . C.1 ERROR DECOMPOSITION We start with the following error decomposition: d (ˆτ, τ ) := J (ˆτ) J (τ ) = J (ˆτ) J (ˆτ H) | {z } ϵ1 + J (ˆτ H) J (τ ) | {z } ϵ2 For ϵ1, we have ϵ1 = J (ˆτ) L (ˆτ) + L (ˆτ) L (ˆτ H) + L (ˆτ H) J (ˆτ H) . We consider the terms one-by-one. By definition, we have J (ˆτ) L (ˆτ) = max f,µ J (ˆτ, u, f) max f F,µ J (ˆτ, u, f) CPπ,κ,λ sup f1,u1 U inf f2 F,u2 U (f1, u1) (f2, u2) p,1 | {z } ϵapprox(F) which is induced by introducing F for dual approximation. For the third term L (ˆτ H) J (ˆτ H), we have L (ˆτ H) J (ˆτ H) = max f F,u U J (ˆτ H, u, f) max f,u U J (ˆτ H, u, f) 0. Published as a conference paper at ICLR 2020 For the term L (ˆτ) L (ˆτ H), L (ˆτ) L (ˆτ H) = L (ˆτ) b L (ˆτ) + b L (ˆτ) b L (ˆτ H) | {z } ˆϵopt +b L (ˆτ H) L (ˆτ H) L (τ) b L (τ) + ˆϵopt max f F,u U J (τ, u, f) max f F,u U b J (τ, u, f) + ˆϵopt 2 sup τ H sup f F,u U J (τ, u, f) b J (τ, u, f) + ˆϵopt = 2 ϵest + ˆϵopt, (22) where we define ϵest := supτ H,f F,u U J (τ, u, f) b J (τ, u, f) . Therefore, we can now bound ϵ1 as ϵ1 CT ,κ,λϵapprox (F) + 2ϵest + ˆϵopt. For ϵ2, we have ϵ2 = J (ˆτ H) J (τ H) + J (τ H) J (τ ) = J (ˆτ H) L (ˆτ H) + L (ˆτ H) L (τ H) + L (τ H) J (τ H) + J (τ H) J (τ ) . We consider the terms from right to left. For the term J (τ H) J (τ ), we have J (ˆτ H) J (τ ) = J ˆτ H, bu , bf F J τ , bu , bf F + J τ , bu , bf F J (τ , u , f ) | {z } 0 = J ˆτ H, bu , bf F J τ , bu , bf F Cφ,C,λ sup τ1 inf τ2 H τ1 τ2 p,1 | {z } ϵapprox(H) which is induced by restricting the function space to H. The second term is nonpositive, due to the optimality of (u , f ). The final inequality comes from the fact that J (τ, u, f) is Cφ,C,λLipschitz w.r.t. τ. For the term L (ˆτ H) J (τ H), by definition L (τ H) J (τ H) = max f F,u U J (τ H, f, u) max f,u U J (τ H, f, u) 0. For the term L (ˆτ H) L (τ H), we have L (ˆτ H) L (τ H) = L (ˆτ H) b L (ˆτ H) + b L (ˆτ H) b L (τ H) | {z } 0 +b L (τ H) L (τ H) = 2 sup τ H L (τ) b L (τ) = 2 sup τ H,f F,u U J (τ, u, f) b J (τ, u, f) = 2 ϵest. where the second term is nonpositive, thanks to the optimality of ˆτ H. Finally, for the term J (ˆτ H) J (τ H), using the same argument in Equation 21, we have J (ˆτ H) J (τ H) CPπ,κ,λϵapprox (F) . Therefore, we can bound ϵ2 by ϵ2 Cφ,C,λϵapprox (H) + CPπ,κ,λ (F) + 2ϵest. In sum, we have d (ˆτ, τ ) 4ϵest + ˆϵopt + 2CPπ,κ,λϵapprox (F) + Cφ,C,λϵapprox (H) . In the following sections, we will bound the ϵest and ˆϵopt. C.2 STATISTICAL ERROR In this section, we analyze the statistical error ϵest := sup τ H,f F,u U J (τ, u, f) b J (τ, u, f) . Published as a conference paper at ICLR 2020 We mainly focus on the batch RL setting with i.i.d. samples D = [(xi, x i)]N i=1 Tp (x, x ), which has been studied by previous authors (e.g., Sutton et al., 2012; Nachum et al., 2019). However, as discussed in the literature (Antos et al., 2008; Lazaric et al., 2012; Dai et al., 2018; Nachum et al., 2019), using the blocking technique of Yu (1994), the statistical error provided here can be generalized to β-mixing samples in a single sample path. We omit this generalization for the sake of expositional simplicity. To bound the ϵest, we follow similar arguments by Dai et al. (2018); Nachum et al. (2019) via the covering number. For completeness, the definition is given below. The Pollard s tail inequality bounds the maximum deviation via the covering number of a function class: Lemma 4 (Pollard (2012)) Let G be a permissible class of Z [ M, M] functions and {Zi}N i=1 are i.i.d. samples from some distribution. Then, for any given ϵ > 0, i=1 g(Zi) E [g(Z)] 8, G, {Zi}N i=1 i exp Nϵ2 The covering number can then be bounded in terms of the function class s pseudo-dimension: Lemma 5 (Haussler (1995), Corollary 3) For any set X, any points x1:N X N, any class F of functions on X taking values in [0, M] with pseudo-dimension DF < , and any ϵ > 0, N1 ϵ, F, x1:N e (DF + 1) 2e M The statistical error ϵest can be bounded using these lemmas. Lemma 6 (Stochastic error) Under the Assumption 2, if φ is κ-Lipschitz continuous and the psuedo-dimension of H and F are finite, with probability at least 1 δ, we have log N + log 1 Proof The proof works by verifying the conditions in Lemma 4 and computing the covering number. Denote the hτ,u,f (x, x ) = (1 γ) f (x )+γτ (x) f (x ) τ (x) φ (f (x))+λuτ (x) λu λ 1 2u2, we will apply Lemma 4 with Z = Ω Ω, Zi = (xi, x i), and G = h H F U. We check the boundedness of hζ,u,f (x, x ). Based on Assumption 2, we only consider the τ H and u U bounded by C and C + 1. We also rectify the f C. Then, we can bound the h : hτ,u,f (1 + τ ) f + τ max t [ C,C] φ (t) + λC ( τ + 1) + λC2 (C + 1)2 + C Cφ + λC (2C + 1) =: M, where Cφ = maxt [ C,C] φ (t). Thus, by Lemma 4, we have sup τ H,f F,u U b J (τ, u, f) J (τ, u, f) sup τ H,f F,u U i=1 hζ,u,f (Zi) E [hζ,u,f] 8, G, {Zi}N i=1 i exp Nϵ2 Published as a conference paper at ICLR 2020 Next, we check the covering number of G. Firstly, we bound the distance in G, i=1 |hτ1,u1,f1 (Zi) hτ2,u2,f2 (Zi)| C + Cφ + λ(C + 1) i=1 |τ1 (xi) τ2 (xi)| + 1 + γC i=1 |f1 (x i) f2 (x i)| i=1 |f1 (xi) f2 (xi)| + λ (2C + 1) |u1 u2| , which leads to N1 (Cφ + (3λ + 2 + γ + κ) (C + 1)) ϵ , G, {Zi}N i=1 N1 ϵ , H, (xi)N i=1 N1 ϵ , F, (x i)N i=1 N1 ϵ , F, (xi)N i=1 N1 (ϵ , U) . For the set U = [ C 1, C + 1], we have, N1 (ϵ , U) 2C + 2 Denote the pseudo-dimension of H and F as DH and DF, respectively, we have N1 (Cφ + (3 + 2λ + κ) (C + 1)) ϵ , G, {Zi}N i=1 e3 (DH + 1) (DF + 1)2 2C + 2 which implies 8, G, {Zi}N i=1 2C e2 (DH + 1) (DF + 1)2 32 (Cφ + (3λ + 2 + γ + κ) (C + 1)) e C where D1 = DH + DF + 1 and 2C e2 (DH + 1) (DF + 1)2 (32 (Cφ + (3λ + 2 + γ + κ) (C + 1)) e C) . Combine this result with Equation 23, we obtain the bound for the statistical error: sup τ H,f F,u U b J (τ, u, f) J (τ, u, f) Setting ϵ = q C2(log N+log 1 δ) N with C2 = max (8C1) 2 D1 , 512MD1, 512M, 1 , we have C.3 OPTIMIZATION ERROR In this section, we investigate the optimization error ˆϵopt := b L (ˆτ) b L (ˆτ H) . Notice our estimator minτ H maxf F,u U b J (τ, u, f) is compatible with different parametrizations for (H, F) and different optimization algorithms, the optimization error will be different. For the general neural network for (τ, f), although there are several progress recently (Lin et al., 2018; Jin et al., 2019; Lin et al., 2019) about the convergence to a stationary point or local minimum, it remains a largely open problem to quantify the optimization error, which is out of the scope of this paper. Here, we mainly discuss the convergence rate with tabular, linear and kernel parametrization for (τ, f). Published as a conference paper at ICLR 2020 Particularly, we consider the linear parametrization particularly, i.e., τ (x) = w τ ψτ (x) with {wτ, ψτ (x) 0} and f (x) = w f ψf (x). With such parametrization, the b J (τ, u, f) is still convexconcave w.r.t (wτ, wf, u). We can bound the ˆϵopt by the primal-dual gap ϵgap: ˆϵopt = b L (ˆτ) b L (ˆτ H) max f F,u U b J (ˆτ, u, f) b J ˆτ H, bu , bf F + b J ˆτ H, bu , bf F min τ H b J τ, bu, bf = max f F,u U b J (ˆτ, u, f) min τ H b J τ, bu, bf | {z } ϵgap With vanilla SGD, we have ϵgap = O 1 , where T is the optimization steps (Nemirovski et al., 2009). Therefore, ϵopt = E [ˆϵopt] = O 1 , where the E [ ] is taken w.r.t. randomness in SGD. C.4 COMPLETE ERROR ANALYSIS We are now ready to state the main theorm in a precise way: Theorem 3 Under Assumptions 2 and 1 , the stationary distribution µ exists, i.e., maxf F ET µ [f] Eµ [φ (f)] = 0. If the φ ( ) is κ-Lipschitz continuous, f C < , f F , and the psuedo-dimension of H and F are finite, the error between the Gen DICE estimate to τ (x) = u(x) p(x) is bounded by E [J (ˆτ) J (τ )] = e O ϵapprox (F, H) + where E [ ] is w.r.t. the randomness in sample D and in the optimization algorithms. ϵopt is the optimization error, and ϵapprox (F, H) is the approximation induced by (F, H) for parametrization of (τ, f). Proof We have the total error as E [J (ˆτ) J (τ )] 4E [ϵest] + E [ϵopt] + ϵapprox (F, H) , (25) where ϵapprox := 2CT ,κ,λϵapprox (F) + Cφ,C,λϵapprox (H). For ϵopt, we can apply the results for SGD in Appendix C.3. We can bound the E [ϵest] by Lemma 6. Specifically, we have E [ϵest] = (1 δ) C2 log N + log 1 by setting δ = 1 Plug all these bounds into Equation 25, we achieve the conclusion. D EXPERIMENTAL SETTINGS D.1 TABULAR CASE For the Taxi domain, we follow the same protocol as used in Liu et al. (2018). The behavior and target policies are also taken from Liu et al. (2018) (referred in their work as the behavior policy for α = 0). We use a similar taxi domain, where a grid size of 5 5 yields 2000 states in total (25 16 5, corresponding to 25 taxi locations, 16 passenger appearance status and 5 taxi status). We set our target policy as the final policy π after running Q-learning (Sutton & Barto, 1998) for 1000 iterations, and Published as a conference paper at ICLR 2020 set another policy π+ after 950 iterations as our base policy. The behavior policy is a mixture policy controlled by α as π = (1 α)π +απ+, i.e., the larger α is, the behavior policy is more close to the target policy. In this setting, we solve for the optimal stationary ratio τ exactly using matrix operations. Since Liu et al. (2018) perform a similar exact solve for |S| variables µ(s), for better comparison we also perform our exact solve with respect to |S| variables τ(s). Specifically, the final objective of importance sampling will require knowledge of the importance weights µ(a|s)/p(a|s). Table 1: Statistics of different graphs. Dataset Number of Nodes Number of Edges BA (Small) 100 400 BA (Large) 500 2000 Cora 2708 5429 Citeseer 3327 4731 For offline Page Rank, the graph statistics are illustrated in Table 1, and the degree statistics and graph visualization are shown in Figure 8. For the Barabasi Albert (BA) Graph, it begins with an initial connected network of m0 nodes in the network. Each new node is connected to m m0 existing nodes with a probability that is proportional to the number of links that the existing nodes already have. Intuitively, heavily linked nodes ( hubs ) tend to quickly accumulate even more links, while nodes with only a few links are unlikely to be chosen as the destination for a new link. The new nodes have a preference to attach themselves to the already heavily linked nodes. For two real-world graphs, it is built upon the real-world citation networks. In our experiments, the weights of the BA graph is randomly drawn from a standard Gaussian distribution with normalization to ensure the property of the transition matrix. The offline data is collected by a random walker on the graph, which consists the initial state and next state in a single trajectory. In experiments, we vary the number of off-policy samples to validate the effectiveness of Gen DICE with limited offline samples provided. 5 10 15 20 25 30 35 Degree Barabasi-Albert Graph 0 20 40 60 80 Degree Barabasi-Albert Graph 0 25 50 75 100 125 150 175 Degree 0 20 40 60 80 100 Degree Figure 8: Degree statistics and visualization of different graphs. D.2 CONTINUOUS CASE We use the Cartpole, Reacher and Half Cheetah tasks as given by Open AI Gym. In importance sampling, we learn a neural network policy via behavior cloning, and use its probabilities for computing importance weights π (a|s)/π(a|s). All neural networks are feed-forward with two hidden layers of dimension 64 and tanh activations. Discrete Control Tasks We modify the Cartpole task to be infinite horizon: We use the same dynamics as in the original task but change the reward to be 1 if the original task returns a termination (when the pole falls below some threshold) and 1 otherwise. We train a policy on this task with standard Deep Q-Learning (Mnih et al., 2013) until convergence. We then define the target policy π as a weighted combination of this pre-trained policy (weight 0.7) and a uniformly random policy (weight 0.3). The behavior policy π for a specific 0 α 1 is taken to be a weighted combination of the pre-trained policy (weight 0.55 + 0.15α) and a uniformly random policy (weight 0.45 0.15α). We train each stationary distribution correction estimation method using the Adam optimizer with batches of size 2048 and learning rates chosen using a hyperparameter search from {0.0001, 0.0003, 0.001, 0.003} and choose the best one as 0.0003. Continuous Control Tasks For the Reacher task, we train a deterministic policy until convergence via DDPG (Lillicrap et al., 2015). We define the target policy π as a Gaussian with mean given by the pre-trained policy and standard deviation given by 0.1. The behavior policy πb for a specific 0 α 1 is taken to be a Gaussian with mean given by the pre-trained policy and standard deviation given by 0.4 0.3α. We train each stationary distribution correction estimation method using the Published as a conference paper at ICLR 2020 Adam optimizer with batches of size 2048 and learning rates chosen using a hyperparameter search from {0.0001, 0.0003, 0.001, 0.003} and the optimal learning rate found was 0.003). For the Half Cheetah task, we also train a deterministic policy until convergence via DDPG (Lillicrap et al., 2015). We define the target policy π as a Gaussian with mean given by the pre-trained policy and standard deviation given by 0.1. The behavior policy πb for a specific 0 α 1 is taken to be a Gaussian with mean given by the pre-trained policy and standard deviation given by 0.2 0.1α. We train each stationary distribution correction estimation method using the Adam optimizer with batches of size 2048 and learning rates chosen using a hyperparameter search from {0.0001, 0.0003, 0.001, 0.003} and the optimal learning rate found was 0.003. E ADDITIONAL EXPERIMENTS E.1 OPE FOR DISCRETE CONTROL On the discrete control task, we modify the Cartpole task to be infinite horizon: the original dynamics is used but with a modified reward function: the agent will receive 1 if the environment returns a termination (i.e., the pole falls below some threshold) and 1 otherwise. As shown in Figure 3, our method shows competitive results with IS and Model-Based in average reward case, but our proposed method finally outperforms these two methods in terms of log MSE loss. Specifically, it is relatively difficult to fit a policy with data collected by multiple policies, which renders the poor performance of IS. Figure 9: Results on Cartpole. Each plot in the first row shows the estimated average step reward over training and different behavior policies (higher α corresponds to a behavior policy closer to the target policy; the same in other figures); M1:α = [0.0, 0.33]; M2: α = [0.0, 0.33, 0.66]) E.2 ADDITIONAL RESULTS ON CONTINUOUS CONTROL In this section, we show more results on the continuous control tasks, i.e., Half Cheetah and Reacher. Figure 10 shows the log MSE towards training steps, and Gen DICE outperforms other baselines with different behavior policies. Figure 11 better illustrates how our method beat other baselines, and can accurately estimate the reward of the target policy. Besides, Figure 12 shows Gen DICE gives better reward estimation of the target policy. In these figures, the left three figures show the performance with off-policy dataset collected by single behavior policy from more difficult to easier tasks. The right two figures show the results, where off-policy dataset collected by multiple behavior policies. Figure 13 shows the ablation study results in terms of estimated rewards. The left two figures shows the effects of different learning rate. When α = 0.33, i.e., the OPE tasks are relatively easier, Gen DICE gets relatively good results in all learning rate settings. However, when α = 0.0, i.e., the estimation becomes more difficult, only Gen DICE in larger learning rate gets reasonable estimation. Interestingly, we can see with larger learning rates, the performance becomes better, and when learning rate is 0.001 with α = 0.0, the variance is very high, showing some cases the estimation becomes more accurate. The right three figures show different activation functions with different Published as a conference paper at ICLR 2020 behavior policy. The square and softplus function works well; while the exponential function shows poor performance under some settings. In practice, we use the square function since its low variance and better performance in most cases. Figure 10: Results on Reacher. Each plot in the first row shows the estimated average step reward over training and different behavior policies (higher α corresponds to a behavior policy closer to the target policy; the same in other figures). Figure 11: Results on Reacher. Each plot in the first row shows the estimated average step reward over training and different behavior policies (higher α corresponds to a behavior policy closer to the target policy; the same in other figures). 0 20000 40000 60000 80000 100000 Half Cheetah ( =0.0, =1.0) 0 20000 40000 60000 80000 100000 2.2 Half Cheetah ( =0.33, =1.0) 0 20000 40000 60000 80000 100000 2.2 Half Cheetah ( =0.66, =1.0) 0 20000 40000 60000 80000 100000 2.2 Half Cheetah ( =M1, =0.99) 0 20000 40000 60000 80000 100000 2.2 Half Cheetah ( =M2, = 0.99) Model-Based Importance Sampling Dual DICE Gen DICE (ours) Oracle Figure 12: Results on Half Cheetah. Each plot in the first row shows the estimated average step reward over training and different behavior policies (higher α corresponds to a behavior policy closer to the target policy. E.3 COMPARISON WITH SELF-NORMALIZATION TRICK The self-normalization trick used in (Liu et al., 2018) encodes the normalization constraint in τ, while the principled optimization technique is considered in Gen DICE. Further, the selfnormalization trick will lead to several disadvantages theoretically, in both statistical and computational aspects : Published as a conference paper at ICLR 2020 Figure 13: Results of ablation study with different learning rates and activation functions. The plots show the estimated average step reward over training and different behavior policies . i) It will generally not produce an unbiased solution. Although 1 |D| P (s,a) D τ(s, a) is an unbiased estimator for E[τ], the plugin estimator τ(s,a) 1 |D| P (s,a) D τ(s,a) will be biased for τ(s,a) ii) It will induce more computational cost. Specifically, the self-normalized ratio will be in the form of τ(s,a) 1 |D| P (s,a) D τ(s,a), which requires to go through all the samples in training set D even for just estimating one stochastic gradient, and thus, is prohibitive for large dataset. Empirically, self-normalization is the most natural and the first idea we tried during this project. We have some empirical results about this method in the OPR setting. Table 2: Comparison between regularization and self-normalization. log KL-divergence self-normalization 4.26 0.157 regularization 4.74 0.163 Despite the additional computational cost, it performs worse than the proposed regularization technique used in the current version of Gen DICE. Table 2 shows a comparison between selfnormalization and regularization on OPR with χ2divergence for BA graph with 100 nodes, 10,000 offline samples, and 20 trials. We stop the algorithms in the same running-time budget.