# batch_stationary_distribution_estimation__2fa9049b.pdf Batch Stationary Distribution Estimation Junfeng Wen * 1 Bo Dai * 2 Lihong Li 2 Dale Schuurmans 1 2 We consider the problem of approximating the stationary distribution of an ergodic Markov chain given a set of sampled transitions. Classical simulation-based approaches assume access to the underlying process so that trajectories of sufficient length can be gathered to approximate stationary sampling. Instead, we consider an alternative setting where a fixed set of transitions has been collected beforehand, by a separate, possibly unknown procedure. The goal is still to estimate properties of the stationary distribution, but without additional access to the underlying system. We propose a consistent estimator that is based on recovering a correction ratio function over the given data. In particular, we develop a variational power method (VPM) that provides provably consistent estimates under general conditions. In addition to unifying a number of existing approaches from different subfields, we also find that VPM yields significantly better estimates across a range of problems, including queueing, stochastic differential equations, post-processing MCMC, and off-policy evaluation. 1. Introduction Markov chains are a pervasive modeling tool in applied mathematics of particular importance in stochastic modeling and machine learning. A key property of an ergodic Markov chain is the existence of a unique stationary distribution; i.e., the long-run distribution of states that remains invariant under the transition kernel. In this paper, we consider a less well studied but still important version of the stationary distribution estimation problem, where one has access to a set of sampled transitions from a given Markov chain, but does not know the mechanism by which the probe points were chosen, nor is able to gather additional data from the *Equal contribution 1Department of Computing Science, University of Alberta, Edmonton, Canada 2Google Brain. Correspondence to: Junfeng Wen . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). underlying process. Nevertheless, one would still like to estimate target properties of the stationary distribution, such as the expected value of a random variable of interest. This setting is inspired by many practical scenarios where sampling from the Markov process is costly or unavailable, but data has already been collected and available for analysis. A simple example is a queueing system consisting of a service desk that serves customers in a queue. Queue length changes stochastically as customers arrive or leave after being served. The long-term distribution of queue length (i.e., the stationary distribution of the underlying Markov chain) is the object of central interest for managing such a service (Haviv, 2009; Serfozo, 2009). In practice, however, queue lengths are physical quantities that can only be measured for moderate periods, perhaps on separate occasions, but rarely for sufficient time to ensure the (stochastic) queue length has reached the stationary distribution. Since the measurement process itself is expensive, it is essential to make reasonable inferences about the stationary distribution from the collected data alone. We investigate methods for estimating properties of the stationary distribution solely from a batch of previously collected data. The key idea is to first estimate a correction ratio function over the given data, which can then be used to estimate expectations of interest with respect to the stationary distribution. To illustrate, consider an ergodic Markov chain with state space X, transition kernel T , and a unique stationary distribution µ that satisfies T (x0|x) µ (x) dx := (T µ) (x0) . (1) Assume we are given a fixed sample of state transitions, D = " T (x0|x) p (x), such that each x has been sampled according to an unknown probe distribution p, but each x0 has been sampled according to the true underlying transition kernel, x0|x T (x0|x). Below we investigate procedures for estimating the point-wise ratios, b (xi) µ(xi) p(xi) , such that the weighted empirical distribution b (xi) I{x = xi} can be used to approximate µ directly, or further used to estimate the expected value of some target function(s) of x Batch Stationary Distribution Estimation with respect to µ. Crucially, the approach we propose does not require knowledge of the probe distribution p, nor does it require additional access to samples drawn from the transition kernel T , yet we will be able to establish consistency of the estimation strategy under general conditions. In addition to developing the fundamental approach, we demonstrate its applicability and efficacy in a range of important scenarios beyond queueing, including: Stochastic differential equations (SDEs) SDEs are an essential modeling tool in many fields like statistical physics (Kadanoff, 2000), finance (Oksendal, 2013) and molecular dynamcis (Liu, 2001). An autonomous SDE describes the instantaneous change of a random variable X by d X = f (X) dt + σ (X) d W , (2) where f (X) is a drift term, σ (X) a diffusion term, and W the Wiener process. Given data D = such that x p (x) is drawn from an unknown probe distribution and x0 is the next state after a small time step according to (2), we consider the problem of estimating quantities of the stationary distribution µ when one exists. Off-policy evaluation (OPE) Another important applica- tion is behavior-agnostic off-policy evaluation (Nachum et al., 2019) in reinforcement learning (RL). Consider a Markov decision process (MDP) specified by M = h S, A, P, Ri, such that S and A are the state and action spaces, P is the transition function, and R is the reward function (Puterman, 2014). Given a policy that maps s 2 S to a distribution over A, a random trajectory can be generated starting from an initial state s0: (s0, a0, r0, s1, a1, r1, . . .), where at ( |st), st+1 P( |st, at) and rt R (st, at). The value of a policy is defined to be its long-term average per-step reward: ( ) := lim T !1 E = E(s,a) d [R(s, a)] , where d denotes the limiting distribution over states S of the Markov process induced by . In behavioragnostic off-policy evaluation, one is given a target policy and a set of transitions D = {(s, a, r, s0)n i=1} P (s0|s, a) p (s, a), potentially generated by multiple behavior policies. From such data, an estimate for ( ) can be formed in terms of a stationary ratio estimator: ( ) = E(s,a) p d (s) (a|s) p(s,a) r (s, a) i=1 b (si, ai)ri . (3) We refer the interested readers to Section 5.4 and Appendix C for further discussion. For the remainder of the paper, we will outline four main contributions. First, we generalize the classical power iteration method to obtain an algorithm, the Variational Power Method (VPM), that can work with arbitrary parametrizations in a functional space, allowing for a flexible yet practical approach. Second, we prove the consistency and convergence of VPM. Third, we illustrate how a diverse set of stationary distribution estimation problems, including those above, can be addressed by VPM in a unified manner. Finally, we demonstrate empirically that VPM significantly improves estimation quality in a range of applications, including queueing, sampling, SDEs and OPE. 2. Variational Power Method To develop our approach, first recall the definition of T and µ in (1). We make the following assumption about T and µ throughout the paper. Assumption 1 (ergodicity) The transition operator T has a unique stationary distribution, denoted µ. Conditions under which this assumption holds are mild, and have been extensively discussed in standard textbooks (Meyn et al., 2009; Levin and Peres, 2017). Next, to understand the role of the probe distribution p, note that we can always rewrite the stationary distribution as µ = p (i.e., µ (x) = p (x) (x), hence (x) = µ(x) p(x) ), provided the following assumption holds. Assumption 2 (absolute continuity) The stationary distribution µ is absolutely continuous w.r.t. p. That is, there exists C < 1 such that k k1 6 C. Assumption 2 follows previous work (Liu and Lee, 2017; Nachum et al., 2019), and is common in density ratio estimation (Sugiyama et al., 2008; Gretton et al., 2009) and off-policy evaluation (Wang et al., 2017; Xie et al., 2019). Combining these two assumptions, definition (1) yields T (x0|x) µ (x) dx = T (x0|x) p (x) µ (x) Tp (x, x0) (x) dx, which implies p (x0) (x0) = Tp (x, x0) (x) dx := Tp (x0) . (4) This development reveals how, under the two stated assumptions, there is sufficient information to determine the unique ratio function that ensures p = µ in principle. Given such a function , we can then base inferences about µ solely on data sampled from p and . 2.1. Variational Power Iteration To develop a practical algorithm for recovering from the constraint (4), in function space, we first consider the classical power method for recovering the µ that satisfies (1). Batch Stationary Distribution Estimation From (1) it can be seen that the stationary distribution µ is an eigenfunction of T . Moreover, it is the principal eigenfunction, corresponding to the largest eigenvalue λ1 = 1. In the simpler case of finite X, the vector µ is the principal (right) eigenvector of the transposed transition matrix. A standard approach to computing µ is then the power method: µt+1 = T µt, (5) whose iterates converge to µ at a rate linear in |λ2|, where λ2 is the second largest eigenvalue of T . For ergodic Markov chains, one has |λ2| < 1 (Meyn et al., 2009, Chap 20). Our initial aim is to extend this power iteration approach to the constraint (4) without restricting the domain X to be finite. This can be naturally achieved by the update where the division is element-wise. Clearly the fixed point of (6) corresponds to the solution of (4) under the two assumptions stated above. Furthermore, just as for µt in (5), t in (6) also converges to at a linear rate for finite X. Unfortunately, the update (6) cannot be used directly in a practical algorithm for two important reasons. First, we do not have a point-wise evaluator for Tp, but only samples from Tp. Second, the operator Tp is applied to a function t, which typically involves an intractable integral over X in general. To overcome these issues, we propose a variational method that considers a series of reformulated problems whose optimal solutions correspond to the updates (6). To begin to develop a practical variational approach, first note that (6) operates directly on the density ratio, which implies the density ratio estimation techniques of Nguyen et al. (2008) and Sugiyama et al. (2012) can be applied. Let φ be a lower semicontinuous, convex function satisfying φ (1) = 0, and consider the induced f-divergence, Dφ ( pk q) = E p [φ ( )] E q [ ] where φ (x) = supy2R x>y φ (y) is the conjugate function of φ. The key property of this formulation is that for any distributions p and q, the inner optimum in satisfies @φ ( ) = q/ p (Nguyen et al., 2008); that is, the optimum in (7) can be used to directly recover the distribution ratio. To apply this construction to our setting, first consider solving a problem of the following form in the dual space: t+1 = arg min Ep(x0) [φ ( (x0))] ETp(x,x0) [@φ ( t (x)) (x0)] Ep(x0) [φ ( (x0))] ETp(x,x0) t(x) [ (x0)] , where to achieve (9) we have applied the inductive assumption that t = @φ ( t). Then, by the optimality property of t+1, we know that the solution t+1 must satisfy @φ ( t+1) = Tp t p = t+1, (10) hence the updated ratio t+1 in (6) can be directly recovered from the dual solution t+1, while also retaining the inductive property that t+1 = @φ ( t+1) for the next iteration. These developments can be further simplified by considering the specific choice φ ( ) = ( 1)2, which leads to φ ( ) = + 2 4 , = @φ ( ) = 1 + t+1 =arg min ETp(x,x0)[ t(x) (x0)] . (11) Crucially, this variational update (11) determines the same update as (6), but overcomes the two aforementioned difficulties. First, it bypasses the direct evaluation of Tp and p, and allows these to be replaced by unbiased estimates of expectations extracted from the data. Second, it similarly bypasses the intractability of the operator application Tp t in the functional space, replacing this with an expectation of t that can also be directly estimated from the data. We now discuss some practical refinements of the approach. 2.2. Normalization For t to be a proper ratio µt p , it should be normalized w.r.t. p, i.e. Ep [ t] = 1. To address this issue, we explicitly ensure normalization by considering a constrained optimization in place of (11): ETp(x,x0) [ t (x) (x0)] , s.t. Ep(x) [ (x)] = 1. We can tackle this by solving its Lagrangian. To avoid instability, we add a regularization term: v2R J( , v) = 1 ETp(x,x0) [ t (x) (x0)] + v (Ep [ ] 1) λ where λ > 0 is a regularization parameter. Crucially, the dual variable v is a scalar, making this problem much simpler than dual embedding (Dai et al., 2017), where the dual variables form a parameterized function that introduces approximation error. The problem (13) is a straightforward convex-concave objective with respect to ( , v) that can be optimized by stochastic gradient descent. The following theorem shows that under certain conditions, the normalization will be maintained for any λ > 0. Theorem 1 (Normalization of solution) If Ep [ t] = 1, then for any λ > 0, the estimator (13) has the same solution as (12), hence Ep [ t+1] = 1. Batch Stationary Distribution Estimation Algorithm 1 Variational Power Method 1: Input: Transition data D = {(x, x0)n i=1}, learning rate , v, number of power steps T, number of inner optimization steps M, batch size B 2: Initialize 3: for t = 1 . . . T do 4: Update and fix the reference network t = 5: for m = 1 . . . M do 6: Sample transition data {(x, x0)B i=1} 7: Compute gradients r J and rv J from (16) 8: = r J . gradient descent 9: v = v + vrv J . gradient ascent 10: end for 11: end for 12: Return Hence, we can begin with any 0 satisfying Ep [ 0] = 1 (e.g., 0 = arg min (Ep [ ] 1)2), and the theorem ensures that the normalization of t+1 will be inductively maintained using any fixed λ > 0. The proof is given in Appendix A. 2.3. Damped Iteration The next difficulty to be addressed arises from the fact that, in practice, we need to optimize the variational objective based on sampled data, which induces approximation error since we are replacing the true operator Tp by a stochastic estimate b Tp such that E[ b Tp] = Tp. Without proper adjustment, such estimation errors can accumulate over the power iterations, and lead to inaccurate results. To control the error due to sampling, we introduce a damped version of the update (Ryu and Boyd, 2016), where instead of performing a stochastic update t+1 = p t, we instead perform a damped update given by t+1 = (1 t+1) t + t+1 where t 2 (0, 1) is a stepsize parameter. Intuitively, the update error introduced by the stochasticity of b Tp is now controlled by the stepsize t. The choice of stepsize and convergence of the algorithm is discussed in Section 3. The damped iteration can be conveniently implemented with minor modifications to the previous objective. We only need to change the sample from Tp in (13) by a weighted sample: v2R J( , v) = 1 (1 t+1)Ep(x0) [ t(x0) (x0)] (15) t+1ETp(x,x0) [ t(x) (x0)] + v (Ep [ ] 1) λ 2.4. A Practical Algorithm A practical version of VPM is described in Algorithm 1. It solves (15) using a parameterized : X 7! R expressed as a neural network with parameters . Given the constraint > 0, we added a softplus activation log(1 + exp( )) to the final layer to ensure positivity. The expectations with respect to p and Tp are directly estimated from sampled data. When optimizing by stochastic gradient methods, we maintain a copy of the previous network t as the reference network to compute the second and third terms of (15). The gradients of J( , v) with respect to and v are given by r J( , v) = Ep [ r ] (1 t+1)Ep [ tr ] t+1ETp [ tr ] + v Ep [r ] , (16) rv J( , v) = Ep [ ] 1 λv. After convergence of in each iteration, the reference network is updated by setting t+1 = . Note that one may apply other gradient-based optimizers instead of SGD. 3. Convergence Analysis We now demonstrate that the final algorithm obtains sufficient control over error accumulation to achieve consistency. For notation brevity, we discuss the result for the simpler form (5) instead of the ratio form (6). The argument easily extends to the ratio form. Starting from the plain stochastic update µt = b T µt 1, the damped update can be expressed by µt = (1 t)µt 1 + t b T µt 1 = (1 t)µt 1 + t T µt 1 + t , where is the error due to stochasticity in b T . The following theorem establishes the convergence properties of the damped iteration. Theorem 2 (Informal) Under mild conditions, after t iteration with step-size t = 1/ 2 + C2 ln t p for some constants C1, C2 > 0, where the expectation is taken over the distribution of iterates (µR)t R=1. In other , and consequently µR converges to µ for ergodic T . The precise version of the theorem statement, together with a complete proof, is given in Appendix B. Note that the optimization quality depends on the number of samples, the approximation error of the parametric family, and the optimization algorithm. There is a complex tradeoff between these factors (Bottou and Bousquet, 2008). On one hand, with more data, the statistical error is reduced, but the computational cost of the optimization increases. On Batch Stationary Distribution Estimation the other hand, with a more flexible parametrization, such as neural networks, reduces the approximation error, but adds to the difficulty of optimization as the problem might no longer be convex. Alternatively, if the complexity of the parameterized family is increased, the consequences of statistical error also increases. Representing in a reproducing kernel Hilbert space (RKHS) is a particularly interesting case, because the problem (13) becomes convex, hence the optimization error of the empirical surrogate is reduced to zero. Nguyen et al. (2008, Theorem 2) show that, under mild conditions, the statistical error can be bounded in rate O in terms of Hellinger distance (β denotes the exponent in the bracket entropy of the RKHS), while the approximation error will depend on the RKHS (Bach, 2014). 4. Related Work The algorithm we have developed reduces distribution estimation to density ratio estimation, which has been extensively studied in numerous contexts. One example is learning under covariate shift (Shimodaira, 2000), where the ratio can be estimated by different techniques (Gretton et al., 2009; Nguyen et al., 2008; Sugiyama et al., 2008; Sugiyama and Kawanabe, 2012). These previous works differ from the current setting in that they require data to be sampled from both the target and proposal distributions. By contrast, we consider a substantially more challenging problem, where only data sampled from the proposal is available, and the target distribution is given only implicitly by (1) through the transition kernel T . A more relevant approach is Stein importance sampling (Liu and Lee, 2017), where the ratio is estimated by minimizing the kernelized Stein discrepancy (Liu et al., 2016). However, it requires additional gradient information about the target potential, whereas our method only requires sampled transitions. Moreover, the method of Liu and Lee (2017) is computationally expensive and does not extrapolate to new examples. The algorithm we develop in this paper is inspired by the classic power method for finding principal eigenvectors. Many existing works have focused on the finite-dimension setting (Balsubramani et al., 2013; Hardt and Price, 2014; Yang et al., 2017), while Kim et al. (2005) and Xie et al. (2015) have extended the power method to the infinitedimension case using RKHS. Not only do these algorithms require access to the transition kernel T , but they also require tractable operator multiplications. In contrast, our method avoids direct interaction with the operator T , and can use flexible parametrizations (such as neural networks) to learn the density ratio without per-step renormalization. Another important class of methods for estimating or sampling from stationary distributions are based on simula- tions. A prominent example is Markov chain Monte Carlo (MCMC), which is widely used in many statistical inference scenarios (Andrieu et al., 2003; Koller and Friedman, 2009; Welling and Teh, 2011). Existing MCMC methods (e.g., Neal et al., 2011; Hoffman and Gelman, 2014) require repeated, and often many, interactions with the transition operator T to acquire a single sample from the stationary distribution. Instead, VPM can be applied when only a fixed sample is available. Interestingly, this suggests that VPM can be used to post-process samples generated from typical MCMC methods to possibly make more effective use of the data. We demonstrated this possibility empirically in Section 5. Unlike VPM, other post-processing methods (Oates et al., 2017) require additional information about the target distribution (Robert and Casella, 2004). Recent advances have also shown that learning parametric samplers can be beneficial (Song et al., 2017; Li et al., 2019), but require the potential function. In contrast, VPM directly learns the stationary density ratio solely from transition data. One important application of VPM is off-policy RL (Precup et al., 2001). In particular, in off-policy evaluation (OPE), one aims to evaluate a target policy s performance, given data collected from a different behavior policy. This problem matches our proposed framework as the collected data naturally consists of transitions from a Markov chain, and one is interested in estimating quantities computed from the stationary distribution of a different policy. (See Appendix C for a detailed description of how the VPM algorithm can be applied to OPE, even when γ = 1.) Standard importance weighting is known to have high variance, and various techniques have been proposed to reduce variance (Precup et al., 2001; Jiang and Li, 2016; Rubinstein and Kroese, 2016; Thomas and Brunskill, 2016; Guo et al., 2017). However, these methods still exhibit exponential variance in the trajectory length (Li et al., 2015; Jiang and Li, 2016). More related to the present paper is the recent work on offpolicy RL that avoids the exponential blowup of variance. It is sufficient to adjust observed rewards according to the ratio between the target and behavior stationary distributions (Hallak and Mannor, 2017; Liu et al., 2018; Gelada and Bellemare, 2019). Unfortunately, these methods require knowledge of the behavior policy, p(a|s), in addition to the transition data, which is not always available in practice. In this paper, we focus on the behavior-agnostic scenario where p(a|s) is unknown. Although the recent work of Nachum et al. (2019) considers the same scenario, their approach is only applicable when the discount factor γ < 1, whereas the method in this paper can handle any γ 2 [0, 1]. 5. Experimental Evaluation In this section, we demonstrate the advantages of VPM in four representative applications. Due to space limit, experi- Batch Stationary Distribution Estimation 100 150 200 250 300 350 400 450 500 Sa PSles 0odel-based 3ower (a) Number of samples 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90 0odel-based 3ower (b) Finish probability Figure 1. Log KL divergence between estimation and the truth. ment details are provided in Appendix D. 5.1. Queueing In this subsection, we use VPM to estimate the stationary distribution of queue length. Following the standard Kendall s notation in queueing theory (Haviv, 2009; Serfozo, 2009), we analyze the discrete-time Geo/Geo/1 queue, which is commonly used in the literature (Atencia and Moreno, 2004; Li and Tian, 2008; Wang et al., 2014). Here the customer inter-arrival time and service time are geometrically distributed with one service desk. The probe distribution p(x) is a uniform distribution over the states in a predefined range [0, B). The observed transition (x, x0) is the length change in one time step. The queue has a closedform stationary distribution that we can compare to (Serfozo, 2009, Sec.1.11). Fig. 1 provides the log KL divergence between the estimated and true stationary distributions. We compare VPM to a model-based approach, which estimates the transition matrix b T (x0|x) from the same set of data, then simulates a long trajectory using b T . It can be seen that our method can be more effective across different sample sizes and queue configurations. 5.2. Solving SDEs We next apply VPM to solve a class of SDEs known as the Ornstein-Uhlenbeck process (OUP), which finds many applications in biology (Butler and King, 2004), financial mathematics and physical sciences (Oksendal, 2013). The process is described by the equation: d X = (m X)dt + σd W where m is the asymptotic mean, σ > 0 is the deviation, > 0 determines the strength, and W is the Wiener process. The OUP has a closed-form solution, which converges to the stationary distribution, a normal distribution N(m, σ2/2 ), as t ! 1. This allows us to conveniently calculate the Maximum Mean Discrepancy (MMD) between the adjusted sample to a true sample. We compare our method with the Euler-Maruyama (EM) method (Gardiner, 2009), which is a standard simulation-based method for solving SDEs. VPM uses samples from the EM steps to train the ratio network and the learned ratio is used to compute weighted MMD. The results are shown in Fig. 2, with different configurations of parameters (m, σ, ). It can be seen that VPM consistently improves over the EM method in terms of the log MMD to a true sample from the normal distribution. The EM method only uses the most recent data, which can be wasteful since the past data can carry additional information about the system dynamics. In addition, we perform experiment on real-world phylogeny studies. OUP is widely used to model the evolution of various organism traits. The results of two configurations (Beaulieu et al., 2012; Santana et al., 2012, Tab.3&1 resp.) are shown in Fig. 2d. Notably VPM can improve over the EM method by correcting the sample with learned ratio. 5.3. Post-processing MCMC In this experiment, we demonstrate how VPM can postprocess MCMC to use transition data more effectively in order to learn the target distributions. We use four common potential functions as shown in the first column of Fig. 3 (Neal, 2003; Rezende and Mohamed, 2015; Li et al., 2018). A point is sampled from the uniform distribution p(x) = Unif(x; [ 6, 6]2), then transitioned through an HMC operator (Neal et al., 2011). The transitioned pairs are used as training set D. We compare VPM to a model-based method that explicitly learns a transition model b T (x0|x), parametrized as a neural network to produce Gaussian mean (with fixed standard deviation of 0.1). Then, we apply b T to a hold-out set drawn from p(x) sufficiently many times, and use the final instances as limiting samples (second column of Fig. 3). As for VPM, since p is uniform, the estimated b is proportional to the true stationary distribution. To obtain limiting samples (third column of Fig. 3), we resample from a hold-out set drawn from p(x) with probability proportional to b . The results are shown in Fig. 3. Note that the modelbased method quickly collapses all training data into highprobability regions as stationary distributions, which is an inevitable tendency of restricted parametrized b T . Our learned ratio faithfully reconstructs the target density as shown in the right-most column of Fig. 3. The resampled data of VPM are much more accurate and diverse than that of the model-based method. These experiments show that VPM can indeed effectively use a fixed set of data to recover the stationary distribution without additional information. To compare the results quantitatively, Fig. 4 shows the MMD of the estimated sample to a true sample. Since there is no easy way to sample from the potential function, the true sample consists of data after 2k HMC steps with rejection sampler. After each MCMC step, VPM takes the transition Batch Stationary Distribution Estimation 100 150 200 250 300 350 400 450 500 6Lmul Dt Dt Lon 6te Ss Euler P 1.00 3ower P 1.00 Euler P 2.00 3ower P 2.00 Euler P 3.00 3ower P 3.00 100 150 200 250 300 350 400 450 500 SLmul Dt Dt Lon Ste Ss Euler sig Pa 1.00 3ower sig Pa 1.00 Euler sig Pa 2.00 3ower sig Pa 2.00 Euler sig Pa 3.00 3ower sig Pa 3.00 (b) Deviation σ 100 150 200 250 300 350 400 450 500 SLmul Dt Dt Lon Ste Ss Euler theta 1.00 3ower theta 1.00 Euler theta 2.00 3ower theta 2.00 Euler theta 3.00 3ower theta 3.00 (c) Strength 100 150 200 250 300 350 400 450 500 6Lmul Dt Dt Lon 6te Ss Euler Santana et al Power Santana et al Euler Beaulieu et al Power Beaulieu et al (d) Phylogeny Studies Figure 2. Log MMD versus number of EM steps across different settings, default (m, σ, ) = (2, 2, 2). (d) is based on the real-world phylogeny studies (Beaulieu et al., 2012; Santana et al., 2012) with (m, σ, ) = (0.618, 1.584, 3.85), (0.661, 0.710, 8.837) respectively. (a) Potentials (d) Estimated Figure 3. The 2nd and 3rd columns are samples from the model- based method and VPM respectively. Rows (from top to bottom) correspond to data sets: 2gauss, funnel, kidney, banana. pairs as input and adjusts the sample importance according to the learned ratio. As we can see, after each MCMC step, VPM is able to post-process the data and further reduce MMD by applying the ratio. The improvement is consistent along different MCMC steps across different datasets. 5.4. Off-Policy Evaluation Finally, we apply our method to behavior-agnostic offpolicy evaluation outlined in Section 1, in which only the transition data and the target policy are given, while the behavior policy is unknown. Concretely, given a sample D = (s, a, r, s0)n from the behavior policy, we compose each transition in D with a target action a0 ( |s0). Denoting x = (s, a), the data set can be expressed as D = . Applying the proposed VPM with T (x0|x), we can estimate µ(s,a) p(s,a) , hence the average accumu- Figure 4. MMD before and after ratio correction using VPM. lated reward can be obtained via (3). Additional derivation and discussion can be found in Appendix C. We conduct experiments on the (discrete) Taxi environment as in Liu et al. (2018), and the challenging (continuous) environments including the Reacher, Half Cheetah and Ant. Taxi is a gridworld environment in which the agent navigates to pick up and drop off passengers in specific locations. The target and behavior policies are set as in Liu et al. (2018). For the continuous environments, the Reacher agent tries to reach a specified location by swinging an robotic arm, while the Half Cheetah/Ant agents are complex robots that try to move forward as much as possible. The target policy is a pre-trained PPO or A2C neural network, which produces a Gaussian action distribution N(mt, t). The behavior policy is the same as target policy but using a larger action variance b = (1 ) t + 2 t, 2 (0, 1]. We collect T trajectories of n steps each, using the behavior policy. We compare VPM to a model-based method that estimates both the transition T and reward R functions. Using behavior cloning, we also compare to the trajectory-wise and stepwise weighted importance sampling (WIST,WISS) (Precup et al., 2001), as well as Liu et al. (2018) with their public code for the Taxi environment. The results are shown in Fig. 5. The x-axes are different configurations and the y-axes are the log Mean Square Batch Stationary Distribution Estimation 100 120 140 160 180 200 n WI6T WI66 0odel-Based LLu-e W-al 3ower 25 50 100 200 n WIST WISS 0odel_Eased 25 50 100 200 n WIST WISS 0odel_Eased 25 50 100 200 n WIST WISS 0odel_Eased 100 200 300 400 T WI6T WI66 0odel-Based LLu-e W-al 3ower 100 200 300 400 T WIST WISS 0odel_Eased 100 200 300 400 T WIST WISS 0odel_Eased 100 200 300 400 T WIST WISS 0odel_Eased 0.4 0.6 0.8 1.0 al Sha WI6T WI66 0odel-Based LLu-e W-al 3ower 1.0 0.8 0.6 0.4 0.2 al Sha WI6T WI66 0odel_Eased (b) Reacher 1.0 0.8 0.6 0.4 0.2 al Sha WI6T WI66 0odel_Eased (c) Half Cheetah 1.0 0.8 0.6 0.4 0.2 al Sha WI6T WI66 0odel_Eased Figure 5. Log MSE of different methods for various datasets and settings. Error (MSE) to the true average target policy reward, estimated from abundant on-policy data collected from the target policy. As we can see, VPM outperforms all baselines significantly across different settings, including number of trajectories, trajectory length and behavior policies. The method by Liu et al. (2018) can suffer from not knowing the behavior policy, as seen in the Taxi environment. Weighted importance sampling methods (WIST,WISS) also require access to the behavior policy. 5.5. Ablation Study In this section, we conduct an ablation study to show that VPM is robust to different choices of the parameters. Fig. 6 shows the MMD curves for the MCMC funnel dataset in Section 5.3, using different learning rates, number of inner optimization steps and the regularization λ. Other datasets show similar trends. Learning rates. In all experiments, we use Adam optimizer (Kingma and Ba, 2014). Fig. 6a shows the convergent behaviour with different learning rates in {0.0003, 0.0006, 0.001, 0.003}. The algorithm can take a longer time to converge when using a small learning rate. Even though large learning rate (e.g., 0.003) seems to converge faster, its final solution can be noisy. We can see that VPM can work using different learning rates around the default Adam learning rate of 0.001. Number of inner optimization steps. Recall that in each power iteration, VPM solves an inner optimization Eq. (15). Fig. 6b shows the the effect of different number of inner optimization steps M. Larger M can produce more accurate power iterator and converge faster in terms of number of power iterations, but the time per iteration will also increase accordingly. If M is too small (e.g., 3), the learning can be unstable and the final ratio network can be inaccurate. Due to the damped update, the error in each power iteration can be controlled effectively and VPM can converge to the optimal ratio as long as M is reasonably large. Regularization. Finally, we investigate the effect of the regularization parameter λ. Intuitively, λ controls the capability of the dual variable v in Eq. (13). The results are shown in Fig. 6c. Although different λ values can have different convergence speeds, their final solutions can achieve low MMD given sufficient iterations, as suggested by Theorem 1. 6. Conclusion We have formally considered the problem of estimating stationary distribution of an ergodic Markov chain using a fixed set of transition data. We extended a classical power iteration approach to the batch setting, using an equivalent variational reformulation of the update rule to bypass the agnosticity of transition operator and the intractable operations in a functional space, yielding a new algorithm Variational Batch Stationary Distribution Estimation 25 50 75 100 125 150 3ower steps 0.0003 0.0006 0.001 0.003 (a) Learning rate 25 50 75 100 125 150 3ower steps (b) Number of inner steps M 25 50 75 100 125 150 Power steps 5.0 1.0 0.5 0.1 (c) Regularization λ Figure 6. Ablation study. MMD versus number of power iterations for the funnel dataset. Default (lr, M, λ) = (0.001, 10, 0.5). Power Method (VPM). We characterized the convergence of VPM theoretically, and demonstrated its empirical advantages for improving existing methods on several important problems such as queueing, solving SDEs, post-processing MCMC and behavior-agnostic off-policy evaluation. Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I. Jordan. An introduction to mcmc for machine learning. Machine Learning, 50:5 43, 2003. Ivan Atencia and Pilar Moreno. The discrete-time geo/geo/1 queue with negative customers and disasters. Computers & Operations Research, 31(9):1537 1548, 2004. Francis R. Bach. Breaking the curse of dimensionality with convex neural networks. Co RR, abs/1412.8690, 2014. Akshay Balsubramani, Sanjoy Dasgupta, and Yoav Freund. The fast convergence of incremental pca. In Advances in Neural Information Processing Systems, pages 3174 3182, 2013. Jeremy M Beaulieu, Dwueng-Chwuan Jhwueng, Carl Boet- tiger, and Brian C O Meara. Modeling stabilizing selection: expanding the ornstein uhlenbeck model of adaptive evolution. Evolution: International Journal of Organic Evolution, 66(8):2369 2383, 2012. L eon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161 168, 2008. Marguerite A Butler and Aaron A King. Phylogenetic com- parative analysis: a modeling approach for adaptive evolution. The American Naturalist, 164(6):683 695, 2004. Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual embeddings. In Artificial Intelligence and Statistics, pages 1458 1467, 2017. Crispin Gardiner. Stochastic methods, volume 4. Springer Berlin, 2009. Carles Gelada and Marc G Bellemare. Off-policy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3647 3655, 2019. Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmit- tfull, Karsten Borgwardt, and Bernhard Sch olkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009. Zhaohan Guo, Philip S Thomas, and Emma Brunskill. Using options and covariance testing for long horizon off-policy policy evaluation. In Advances in Neural Information Processing Systems, pages 2492 2501, 2017. Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1372 1383. JMLR. org, 2017. Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. In Advances in Neural Information Processing Systems, pages 2861 2869, 2014. Moshe Haviv. Queues a course in queueing theory. The Hebrew University, Jerusalem, 91905, 2009. Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15 (1):1593 1623, 2014. Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652 661, 2016. Leo P Kadanoff. Statistical physics: statics, dynamics and renormalization. World Scientific Publishing Company, 2000. Batch Stationary Distribution Estimation Kwang In Kim, Matthias O. Franz, and Bernhard Sch olkopf. Iterative kernel principal component analysis for image modeling. IEEE transactions on pattern analysis and machine intelligence, 27(9):1351 1366, 2005. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017. Chunyuan Li, Ke Bai, Jianqiao Li, Guoyin Wang, Changyou Chen, and Lawrence Carin. Adversarial learning of a sampler based on an unnormalized distribution. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3302 3311, 2019. Ji-hong Li and Nai-shuo Tian. Analysis of the discrete time geo/geo/1 queue with single working vacation. Quality Technology & Quantitative Management, 5(1):77 89, 2008. Lihong Li, Remi Munos, and Csaba Szepesvari. Toward minimax off-policy value estimation. In Artificial Intelligence and Statistics, pages 608 616, 2015. Wenliang Li, Dougal Sutherland, Heiko Strathmann, and Arthur Gretton. Learning deep kernels for exponential family densities. ar Xiv preprint ar Xiv:1811.08357, 2018. Jun S. Liu. Monte Carlo strategies in scientific computing. Springer, 2001. ISBN 0387952306. Qiang Liu and Jason Lee. Black-box importance sampling. In Artificial Intelligence and Statistics, pages 952 961, 2017. Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276 284, 2016. Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, pages 5356 5366, 2018. Sean Meyn, Richard L. Tweedie, and Peter W. Glynn. Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2 edition, 2009. doi: 10.1017/CBO9780511626630. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Tal- walkar. Foundations of Machine Learning. The MIT Press, 2012. ISBN 026201825X, 9780262018258. Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Co RR, abs/1906.04733, 2019. URL http://arxiv.org/abs/1906.04733. Radford M Neal. Slice sampling. Annals of statistics, pages 705 741, 2003. Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011. X.L. Nguyen, M. Wainwright, and M. Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in Neural Information Processing Systems 20, pages 1089 1096. MIT Press, Cambridge, MA, 2008. Chris J Oates, Mark Girolami, and Nicolas Chopin. Control functionals for monte carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695 718, 2017. Bernt Oksendal. Stochastic differential equations: an intro- duction with applications. Springer Science & Business Media, 2013. Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off- policy temporal-difference learning with function approximation. In ICML, pages 417 424, 2001. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. ar Xiv preprint ar Xiv:1505.05770, 2015. C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, second edition, 2004. Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method, volume 10. John Wiley & Sons, 2016. Ernest K Ryu and Stephen Boyd. Primer on monotone oper- ator methods. Appl. Comput. Math, 15(1):3 43, 2016. Sharlene E Santana, Ian R Grosse, and Elizabeth R Dumont. Dietary hardness, loading behavior, and the evolution of skull form in bats. Evolution: International Journal of Organic Evolution, 66(8):2587 2598, 2012. Richard Serfozo. Basics of applied stochastic processes. Springer Science & Business Media, 2009. H. Shimodaira. Improving predictive inference under con- variance shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 2000. Batch Stationary Distribution Estimation Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice- mc: Adversarial training for mcmc. In Advances in Neural Information Processing Systems, pages 5140 5150, 2017. Masashi Sugiyama and Motoaki Kawanabe. Machine learn- ing in non-stationary environments: Introduction to covariate shift adaptation. MIT press, 2012. Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B unau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699 746, 2008. Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012. R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. Philip Thomas and Emma Brunskill. Data-efficient off- policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139 2148, 2016. Fang Wang, Jinting Wang, and Feng Zhang. Equilibrium customer strategies in the geo/geo/1 queue with single working vacation. Discrete Dynamics in Nature and Society, 2014, 2014. Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. Op- timal and adaptive off-policy evaluation in contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3589 3597. JMLR.org, 2017. Max Welling and Yee-Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning (ICML), pages 681 688, 2011. Bo Xie, Yingyu Liang, and Le Song. Scale up nonlinear component analysis with doubly stochastic gradients. ar Xiv preprint ar Xiv:1504.03655, 2015. Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems 32, pages 9665 9675, 2019. Lin F Yang, Vladimir Braverman, Tuo Zhao, and Mengdi Wang. Online factorization and partition of complex networks from random walks. ar Xiv preprint ar Xiv:1705.07881, 2017.