# particle_flow_bayes_rule__e30260e5.pdf Particle Flow Bayes Rule Xinshi Chen * 1 Hanjun Dai * 2 Le Song 2 3 We present a particle flow realization of Bayes rule, where an ODE-based neural operator is used to transport particles from a prior to its posterior after a new observation. We prove that such an ODE operator exists. Its neural parameterization can be trained in a meta-learning framework, allowing this operator to reason about the effect of an individual observation on the posterior, and thus generalize across different priors, observations and to sequential Bayesian inference. We demonstrated the generalization ability of our particle flow Bayes operator in several canonical and high dimensional examples. 1. Introduction In many data analysis tasks, it is important to estimate unknown quantities x Rd from observations Om := {o1, , om}. Given prior knowledge π(x) and likelihood functions p(ot|x), the essence of Bayesian inference is to compute the posterior p(x|Om) π(x) Qm t=1 p(ot|x) by Bayes rule. For many nontrivial models, the prior might not be conjugate to the likelihood, making the posterior not in a closed form. Therefore, computing the posterior often results in intractable integration and poses significant challenges. Typically, one resorts to approximate inference methods such as sampling (e.g., MCMC) (Andrieu et al., 2003) or variational inference (Wainwright & Jordan, 2003). In many real problems, observations arrive sequentially online, and Bayesian inference needs be performed recursively, updated posterior z }| { p(x|Om+1) current posterior z }| { p(x|Om) likelihood z }| { p(om+1|x) . (1) That is the estimation of p(x|Om+1) should be computed based on the estimation of p(x|Om) obtained from the last iteration and the presence of the new observation om+1. *Equal contribution 1School of Mathematics, 2School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA. 3Ant Financial, Hangzhou, China. Correspondence to: Xinshi Chen . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). It therefore requires algorithms which allow for efficient online inference. In this case, both standard MCMC and variational inference become inefficient, since the former requires a complete scan of the dataset in each iteration, and the latter requires solving an optimization for every new observation. Thus, sequential Monte Carlo (SMC) (Doucet et al., 2001; Balakrishnan & Madigan, 2006) or stochastic approximations, such as stochastic gradient Langevin dynamics (Welling & Teh, 2011) and stochastic variational inference (Hoffman et al., 2012), are developed to improve the efficiency. However, SMC suffers from the path degeneracy problems in high dimensions (Daum & Huang, 2003; Snyder et al., 2008), and rejuvenation steps are designed but may violate the online sequential update requirement (Canini et al., 2009; Chopin et al., 2013). Stochastic approximation methods are prescribed algorithms that cannot exploit the structure of the problem for further improvement. To address these challenges, the seminal work of Kernel Bayes Rule (KBR) views the Bayes update as an operator in reproducing kernel Hilbert spaces (RKHS) which can be learned and directly produce the posterior from prior after each observation (Fukumizu et al., 2012). In the KBR framework, the posterior is represented as an embedding µm := Ep(x|Om)[φ(x)] using a feature map φ( ) associated with a kernel function; then the kernel Bayes operator K( , o) will take this embedding as input and produce the embedding of the updated posterior, updated embedding z }| { µm+1 = K( current embedding z}|{ µm , om+1 ). (2) Another novel aspect of KBR method is that it contains a training phase and a testing phase, where the structure of the problem at hand (e.g., the likelihood) is taken into account in the training phase, and in the testing phase, the learned operator K will directly operate on the current posterior µm to produce the output. However, despite the nice concepts of KBR operator, it only works well for a limited range of problems due to its strong theoretical assumptions. In this work, we aim to lift the limitation of KBR operator, and will design a novel continuous particle flow operator F to realize the Bayes update, for which we call it particle flow Bayes rule (PFBR). In the PFBR framework (Fig. 1), a prior distribution π(x), or, the current posterior πm(x) := p(x|Om) is approximated by a set of N equally Particle Flow Bayes Rule Figure 1: PFBR framework: sequential Bayesian inference as a deterministic flow of particles. weighted particles Xm = {x1 m, . . . , x N m}; and then, given an observation om+1, the flow operator F(Xm, xn m, om+1) will transport each particle xn m to a new particle xn m+1 to approximate the new posterior p(x|Om+1). That is, updated particle z }| { xn m+1 = F(Xm, current particle z}|{ xn m , om+1), (3) where Xm+1 = {x1 m+1, . . . , x N m+1} will be used as samples from the new posterior p(x|Om+1). Furthermore, this PFBR operator F can be applied recursively to Xm+1 and a new observation om+2 to produce Xm+2, and so on. In a high-level, we model the PFBR operator F as a continuous deterministic flow, which propagates the locations of particles and the values of their probability density simultaneously through a dynamical system described by ordinary differential equations (ODEs). A natural question is whether a fixed ODE-based Bayes operator applicable to different prior distributions and likelihood functions exists. In our paper, we resolve this important theoretical question by making a novel connection between PFBR operator and the Fokker-Planck equation of Langevin dynamics. The proof of existence also provides a basis for our parameterization of PFBR operator using Deep Set (Zaheer et al., 2017). Similar to KBR, PFBR have a training phase and a testing phase. However, the training procedure is very different as it adopts a meta-learning framework (Andrychowicz et al., 2016), where multiple related Bayesian inference tasks with different priors and observations are created. PFBR operator F will learn from these tasks how to update the posteriors given new observations. During test phase, the learned PFBR will be applied directly to new observations without either re-optimization or storing previous observations. We conduct various experiments to show that the learned PFBR operator can generalize to new Bayesian inference tasks. Related work. There is a recent surge of interests in ODEbased Bayesian inference (Chen et al., 2018; Zhang et al., 2018; Grathwohl et al., 2018; Lei et al., 2017). These works focus on fitting a single target distribution. Consequently, the learned flow can not generalize directly to a new dataset, a new prior or to sequential setting without re-optimization. 2. Bayesian Inference as Particle Flow We present details in this section from four aspects: (1) How to map sequential Bayes inference to particle flow? (2) What is the property of such particle flow? (3) Does a shared flow-based Bayesian operator F exist? (4) How to parameterize the flow operator F? 2.1. PFBR: Particle Flow Bayes Rule The problem mapping from sequential Bayesian inference to particle flow goes as follows. Initially, N particles X0 = {x1 0, . . . , x N 0 } are sampled i.i.d. from a prior π(x). Given an observation o1, the operator F will transport the particles to X1 = {x1 1, . . . , x N 1 } to estimate the posterior p(x|O1) π(x)p(o1|x). We define this transformation as the solution of an ODE. That is, n, dx dt = f(X0, o1, x(t), t), t (0, T] x(0) = xn 0 gives == xn 1 = x(T). The flow velocity f takes observation o1 as input and determines both direction and speed of the change of x(t). In this ODE model, each particle xn 0 sampled from the prior gives an initial value x(0), and then the flow velocity f will evolve the particle continuously and deterministically. At terminate time T, we will take solution x(T) as the transformed particle xn 1 for estimating the posterior. Applying this ODE-based transformation sequentially as new observations o2, o3, . . . arrive, we can define a recursive particle flow Bayes operator, called PFBR, as xn m+1 = F(Xm, om+1, xn m) := xn m + R T 0 f(Xm, om+1, x(t), t) dt. (4) The set of obtained particles Xm+1 can be used to perform Bayesian inference such as estimating the mean and quantifying the uncertainty of any test function by averaging over these particles. At this moment, we assume f has a form of f(X, o, x(t), t), and will be shared across different sequential stages. In section 2.3, a rigorous discussion on the existence of a shared flow velocity of this form will be made. Next, we will discuss further properties of this continuous particle flow which will help us study the existence of such operator for Bayesian inference, and design the parameterization and the learning for the flow velocity. 2.2. Property of Continuous Deterministic Flow The continuous transformation of x(t) described by ODE dx/dt = f defines a deterministic flow for each particle. Let q(x, t) be the probability density of the continuous random variable x(t). The change of this density is also determined by f. More specifically, q follows the continuity equation (Batchelor, 2000): q(x, t)/ t = x (qf), (5) Particle Flow Bayes Rule where x is the divergence operator. Continuity equation is the mathematical expression for the law of local conservation of mass - mass can neither be created nor destroyed, nor can it teleport from one place to another. Given continuity equation, one can describe the change of log-density by another differential equation (Theorem 2.1). Theorem 2.1. If dx/dt = f, then the change in log-density follows the differential equation (Chen et al., 2018) d log(q(x, t))/dt = x f. (6) Since for any physical quantity q(x, t), the distinguish between material derivative d/dt and partial derivative / t is important, we clarify the definition before the proof of this theorem. Definition 2.1. Material derivative of q(x, t) is defined as dq/dt = q/ t + xq dx/dt. (7) Note that dq/dt defines the rate of change of q in a given particle as it moves along its trajectory x = x(t) in the flow, while q/ t means the rate of change of q at a particular point x that is fixed in the space. Proof of Theorem 2.1. By continuity equation, q t = xq f q x f dq dt = q x f. By chain rule, we have d log q q dq dt = 1 q( q x f) = x f. Theorem 2.1 gives the same result as the Instantaneous Change of Variables Theorem stated by Chen et al. (2018). However, our statement is more accurate using the notation of material and partial derivatives. Our proof is simpler and intuitively clearer using continuity equation. This also helps us to see the connection to other physics problems such as fluid dynamics and electromagnetism. With theorem 2.1, we can compute the log-density of the particles by integrating across (0, T] for each n: log qm+1(xn m+1) = log qm(xn m) R T 0 x f dt. (8) 2.3. Existence of Flow-based Bayes Rule Does a unified flow velocity f exist for different Bayesian inference tasks involving different priors and observations? If it does, what is the form of this function? These questions are non-trivial even for simple Gaussian case. For instance, let the prior π(x) = N(0, σx) and the likelihood p(o|x) = N(x, σ) both be one dimensional Gaussian distributions. Given an observation o = 0, the posterior distribution is N(0, (σ σx)/(σ + σx)). It means the ODE dx/dt = f needs to push a zero mean Gaussian distribution with covariance σx to another zero mean Gaussian distribution with covariance (σ σx)/(σ + σx) for any σx. It is not clear whether such a unified flow velocity function f exists and what is the form for it. To resolve the existence issue, we will first establish a connection between the deterministic flow in Section 2.2 and the stochastic flow: Langevin dynamics. Then we will leverage the connection between closed-loop control and open-loop control to show the existence of a unified f. 2.3.1. CONNECTION TO STOCHASTIC FLOW Langevin dynamics is a stochastic process dx(t) = x log p(x|Om)p(om+1|x) dt + 2dw(t), (9) where dw(t) is a standard Brownian motion. Given a fixed initial location x(0), multiple runs of the Langevin dynamics to time t will result in multiple random locations of x(t) due to the randomness of w(t). This stochastic flow is very different in nature comparing to the deterministic flow in Section 2.2, where a fixed location x(0) will always end up with the same location x(t). Nonetheless, while Langevin dynamics is a stochastic flow of a continuous random variable x(t), the probability density q(x, t) of x(t) follows a deterministic evolution according to the associated Fokker-Planck equation (Jordan et al., 1998) q/ t = x (q x log p(x|Om)p(om+1|x)) + xq(x, t), (10) where x = x x is the Laplace operator. Furthermore, if the so-called potential function Ψ(x) := log p(x|Om)p(om+1|x) is smooth and e Ψ L1(Rd), the Fokker-Planck equation has a unique stationary solution in the form of a Gibbs distribution (Jordan et al., 1998), q(x, ) = e Ψ/Z = p(x|Om)p(om+1|x)/Z. (11) Clearly, this stationary solution is the posterior distribution p(x|Om+1). Now we will rewrite the Fokker-Planck equation in Eq. (10) into the form of the deterministic flow in Eq. (5) from Section 2.2, and hence identify the corresponding flow velocity. Theorem 2.2. Assume the deterministic transformation of a continuous random variable x(t) is dx/dt = f, where f = x log p(x|Om)p(om+1|x) x log q(x, t) (12) and q(x, t) is the probability density of x(t). If the potential function Ψ is smooth and e Ψ L1(Rd), then q(x, t) converges to p(x|Om+1) as t . Proof. By continuity equation in Eq. (5), the probability density q(x, t) of x(t) satisfies q/ t = x q xΨ(x) x log q(x, t) . It is easy to see this equation is the same as the Fokker-Planck equation in Eq. (10), by decomposing the Laplace as xq = x (q x log q). Under the conditions for the potential function Ψ, since Fokker Planck equation has a unique stationary distribution q(x, ) equal to the posterior distribution p(x|Om+1), the deterministic flow in Eq. (12) will also converge to p(x|Om+1). Particle Flow Bayes Rule The implication of Theorem 2.2 is that we can construct a deterministic flow of particles to obtain the posterior and hence establish the existence. However, the flow velocity in Eq. (12) depends on the intermediate density q(x, t) which changes over time. This seemingly suggests that f can not be expressed as a fixed function of p(x|Om) and p(om+1|x). In the next section, we show that this dependence on q(x, t) can be removed using theory of optimal control for deterministic systems. 2.3.2. CLOSED-LOOP TO OPEN-LOOP CONVERSION Now the question is: whether the term x log q(x, t) in Eq. (12) can be made independent of q(x, t), or whether there is a equivalent form which can achieve the same flow. To investigate this question, we consider the the following deterministic optimal control problem min w d(q(x, ), p(x|Om+1)) (13) dt = x log p(x|Om)p(om+1|x) w, (14) where d can be any metric defined on the set of densities over Rd. In Eq. (13), we are optimizing over w, which is usually called the control. By Theorem 2.2, w = x log q(x, t) is apparently an optimal solution. Furthermore, the corresponding flow velocity derived by Fokker-Planck equation can be regarded as the continuous steepest descent of KL(q(x, t)||p(x|Om+1)) under Wasserstein distance (Jordan et al., 1998). We are seeking an alternative expression to the above optimal solution which only depends on p(x, Om) and p(om+1|x). First, we introduce the terminology below from optimal control literature. Definition 2.2. In optimal control literature, a control in a feed-back form w = w(q(x, t), t) is called closed-loop. In contrast, another type of control w = w(q(x, 0), t) is called open-loop. An open-loop control is determined when the initial state q(x, 0) is observed, whereas, a closed-loop control can adapt to the encountered states q(x, t). Theorem 2.3. For the optimal control problem in Eq. (13) and Eq. (14), there exists an open-loop control w = w (q(x, 0), t) such that the induced state q (x, t) satisfies q (x, ) = p(x|Om+1). Moreover, w has a fixed expression with respect to p(x|Om) and p(om+1|x) across different m. Proof. By Theorem 2.2, w (q(x, t), t) := x log q(x, t) can induce the optimal state q (x, ) = p(x|Om+1) and achieve a zero loss, d = 0. Hence, w is an optimal closedloop control for this problem. Although in general closed-loop control has a stronger characterization to the solution, in a deterministic system like Eq. (14), the optimal closed-loop control and the optimal open-loop control will give the same control law and achieve the same optimal loss (Dreyfus, 1964). Hence, there exists an optimal open-loop control w = w (q(x, 0), t) such that the induced state also gives a zero loss and thus q (x, ) = p(x|Om+1). More details are provided in Appendix A to express w as a fixed function of q(x, 0), p(x|Om), p(om+1|x) and t. Conclusion of a unified f. In sequential Bayesian inference, we will set q(x, 0) as p(x|Om). Therefore, Theorem 2.3 shows that there exists a fixed and deterministic flow velocity f of the form x log p(x|Om)p(om+1|x) w (p(x|Om), t), (15) which can transform p(x|Om) to p(x|Om+1) and in turns define a unified particle flow Bayes operator F. 2.4. Parametrization We design a practical parameterization of f based on the expression of the unified flow f in Eq. (15). (i) p(x|Om) Xm: Since we do not have full access to the density p(x|Om) but have samples Xm = {x1 m, . . . , x N m} from it, we can use these samples as surrogates for p(x|Om). A related example is feature space embedding of distributions (Smola et al., 2007), µX (p) := R X φ(x)p(x) dx 1 N PN n=1φ(xn), xn p. Ideally, if µX is an injective mapping from the space of probability measures over X to the feature space, the resulting embedding can be treated as a sufficient statistic of the density and any information we need from p(x|Om) can be preserved. Hence, we represent p(x|Om) by 1 N PN n=1 φ(xn m), where φ( ) is a nonlinear feature mapping to be learned. Since we use a neural version of φ( ), this representation can also be regarded as a Deep Set (Zaheer et al., 2017). (ii) p(om+1|x) (om+1, x(t)): In both Langevin dynamics and Eq. (15), the only term containing the likelihood is x log p(om+1|x). Consequently, we can use this term as an input to f. In the case when the likelihood function is fixed, we can also simply use the observation om+1, which results in similar performance in our experiments. Overall we parameterize the flow velocity as f = h 1 N PN n=1φ(xn m), om+1, x(t), t , (16) where h and φ are neural networks (See specific architecture we use in Appendix C.1). Let θ Θ be their parameters which are independent of t. From now on, we write f = fθ(Xm, om+1, x(t), t). In the next section, we will propose a meta learning framework for learning these parameters. 3. Learning Algorithm Since we aim to learn a generalizable Bayesian operator, we need to create multiple inference tasks as the training set and design the corresponding learning algorithm. Particle Flow Bayes Rule Multi-task Framework. The training set Dtrain contains multiple inference tasks. Each task is a tuple T := (π(x), p( |x), OM := {o1, . . . , o M}) Dtrain which consists of a prior distribution, a likelihood function and a sequence of M observations. A task with M sequential observations can also be interpreted as a sequence of M sub-tasks with 1 observation: τ := (p(x|Om), p( |x), om+1) (π(x), p( |x), OM). Therefore, each task is a sequential Bayesian inference and each sub-task corresponds to one step Bayesian update. Cumulative Loss Function. For each sub-task we define a loss KL(qm(x)||p(x|Om+1)), where qm(x) is the distribution transported by F at m-th stage and p(x|Om+1) is the target posterior (see Fig. 1 for illustration). Meanwhile, the loss for the corresponding sequential task will be PM m=1KL(qm(x)||p(x|Om)), which sums up the losses of all intermediate stages. Since its optimality is independent of normalizing constants, it is equivalent to minimize the negative evidence lower bound (ELBO) n=1 (log qm(xn m) log p(xn m, Om)) . (17) The above expression is an empirical estimation using particles xn m. In each iteration during training phase, we will randomly sample a task from Dtrain and update the PFBR operator F by the loss gradient. 3.1. Training Tasks Creation Similar to some meta learning problems, the distribution of training tasks will essentially affect learner s performance on testing tasks (Dai et al., 2017). Depending on the nature of the Bayesian problem, we propose two approaches to construct the multitask training set: one is data driven, and the other is based on generative models. The general principle is that the collection of training priors have to be diverse enough or representative of those may be seen in testing time. Data-Driven Approach. We can use the posterior distribution obtained from last Bayesian inference step as the new prior distribution. If the posterior distribution has an analytical (but unnormalized) expression, we will directly use this expression. More precisely, since each Bayesian inference step will generate a set of particles, Xm = {x1 m, . . . , x N m}, corresponding to the posterior distribution p(x|Om), we can apply a kernel density estimator on top of these samples to obtain an empirical density function ˆπ(x; Xm) = 1 2πσd e x xn m 2 where σ is the kernel bandwidth. Then this density function Algorithm 1 Overall Learning Algorithm θ random initialization For itr = 1 to #iterations do T = (π(x), p( |x), OM) sampled from Dtrain X0 = {xn 0}N n=1 i.i.d. π(x) initial particles θ θ η Grad(θ, X0, π(x), p( |x), OM) if mod(itr, k) = 0 then Perform a validation step on Dvali validation return best θ in validation steps Remark. See Appendix B for both derivation and algorithm steps of Grad(). with a set of samples from it will be used as the new prior for the next training task. This approach has two attractive properties. First, it does not require any prior knowledge of the model and thus is generally applicable to most problems. Second, it provides us a way of breaking a long sequence (π(x), p( |x), OM) with large M into multiple tasks with shorter sequences (π(x), p( |x), o1:m) (ˆπ(x; Xm), p( |x).om+1:M), (19) This will help make the training procedure more efficient. This approach will be particularly suited to sequential Bayesian inference and is used in later experiments. Generative Model Approach. Another possibility is to sample priors from a flexible generative model, such as a Dirichlet process Gaussian mixture model (Antoniak, 1974). We will leave the experimental evaluation of this approach for future investigation. 3.2. Overall Learning Algorithm Learning the PFBR operator F is learning the parameters θ of fθ to minimize the following loss for multiple tasks L(Dtrain) = 1 |Dtrain| T Dtrain L(T ), where L(T ) is the loss for a single task defined in Eq. (17). Both the particle location xn m and its density value qm(xn m) in L(T ) are results of forward propagation determined by fθ according to ODEs in Eq. (4) and Eq. (8). The training procedure is similar to other deep learning optimizations, except that an ODE technique called adjoint method (Chen et al., 2018) is used to compute the gradient d L/dθ with very low memory cost. The overall algorithm under a metalearning framework is summarized in Algorithm 1. 3.3. Efficient Learning Tricks We introduce two techniques to improve training efficiency for large scale problems. Its application to an experiment which contains millions of data points is demonstrated in Section 4. These two techniques can be summarized as mini-batch embedding and sequence-segmentation. Particle Flow Bayes Rule The loss function L(T ) in Eq. (17) contains a summation from m = 1 to M and also a hidden inner-loop summation log p(xn m, Om) = log π(xn m) + Pm t=1 log p(ot|xn m). (20) Thus the evaluation cost of L(T ) is quadratic with respect to the length M of the observation sequences. Therefore, we need to reduce the length for large scale problems. Mini-batch embedding. Previously we defined om as a single observation. However, we can also view it as a batch of L observations, i.e., om = {o1 m, . . . , o L m}. Each Bayesian update corresponding to this mini-batch will become p(x|Om+1) p(x|Om) QL l=1 p(ol m|x). If we rewrite p(om|x) = QL l=1 p(ol m|x), this is essentially the same as our previous expression. Therefore, we can replace om by {ol m}L l=1 and input these samples simultaneously as a set to the flow fθ. To reduce the input dimension, we resort to a set-embedding 1 L PL l=1 g(ol m), where g is a neuralized nonlinear function to be learned. Depending on the structure of the model, one define the embedding as a Deepset (Zaheer et al., 2017), or, if the posterior is not invariant with respect to the order of the observations (e.g. hidden Markov models), one need to use a set-embedding that is not order invariant. To conclude, for a mini-batch Bayesian update, the parameterization of flow velocity can be modified as f = h 1 N PN n=1φ(xn m), 1 L PL l=1g(ol m+1), x(t), t . Sequence-segmentation. We will use the approach in Eq. (19) to break a long sequence into short ones. More precisely, suppose we have particles {xn m }N n=1 at m -th stage. We can cut the sequence at position m and generate a new task using the second half. The prior for the new sequence will be an empirical density estimation ˆπ(x; Xm ) as defined in Eq. (18). Then, for all stages m > m , the terms in Eq. (20) becomes log p(x, Om) log ˆπ(xn m; Xm ) + t=m +1 log p(ot|xn m). We can apply this technique for multiple times to split a long sequence into multiple segments. 4. Experiments We conduct experiments on multivariate Gaussian model, hidden Markov model and Bayesian logistic regression to demonstrate the generalization ability of PFBR and also its accuracy for posterior estimation. Evaluation metric. For multivariate Gaussian model and Gaussian linear dynamical system, we could calculate the true posterior. Therefore, we can evaluate: (i) Cross-entropy Ex p log q(x); (ii) Maximum mean discrepancy (Gretton et al., 2012) MMD2 = Ex,x p; y,y q[k(x, x ) 2k(x, y) + k(y, y )]; (iii) Integral evaluation discrepancy Eph(x) Eqh(x) 2; where we use Monte Carlo method to compute Ex p[ ] for the first two metrics. For the integral evaluation, we choose some test functions h where the exact value of Eph(x) has a closed-form expression. For the experiments on real-world dataset, we estimate the commonly used prediction accuracy due to the intractability of the posterior. Multivariate Gaussian Model. The prior x N(µx, Σx), the observation conditioned on prior o|x N(x, Σo) and the posterior all follow Gaussian distributions. In our experiment, we use µx = 0, Σx = I and Σo = 3I. We test the learned PFBR on different sequences of 100 observations O100, while the training set only contains sequences of 10 observations, which are ten times shorter than sequences in the testing set. However, since we construct a set of different prior distributions to train the operator, the diversity of the prior distributions allows the learned F to generalize to compute posterior distributions in a longer sequence. We compare the performances with KBR (Fukumizu et al., 2012) and one-pass SMC (Balakrishnan & Madigan, 2006). Both PFBR and KBR are learned from the training set and then used as algorithms on the testing set, which consists of 25 sequences of 100 observations {Oj 100 N(xj, Σo)}25 j=1 conditioned on 25 different xj sampled from N(0, I). We compare estimations of p(x|Oj m) across stages from m = 1 to 100 and the results are plotted in Fig. 2. Since KBR s performance reveals to be less comparable in this case, we leave its results to Appendix D. We can see from Fig. 2 that as the dimension of the model increases, our PFBR has more advantages over one-pass SMC. Gaussian Mixture Model. Following the same setting as Welling & Teh (2011), we conduct an experiment on an interesting mixture model, where the observations o 1 2 N(x1, 1.0) + N(x1 + x2, 1.0) and the prior x1, x2 N(0, 1). The same as Dai et al. (2016), we set (x1, x2) = (1, 2) so that the resulting posterior will have two modes: (1, 2) and ( 1, 2). During training, multiple sequences are generated, each of which consists of 30 observations. Compared to Welling & Teh (2011) and Dai et al. (2016), our experiment is more challenging. First, they are only fitting one posterior given one sequence of observations via optimization, while PFBR will operate on sequences that are NOT observed during training without re-optimization. Second, they are only estimating the final posterior, while we aim at fitting every intermediate posteriors. Even for one fixed sequence, it is not easy to fit the posterior and capture both two modes, which can be seem from the results reported by Dai et al. (2016). However, our learned PFBR can operate on testing sequences and the resulting posteriors look closed enough to true posterior (See Fig. 3). Hidden Markov Model LDS. Our PFBR method can be easily adapted to hidden Markov models, where we will esti- Particle Flow Bayes Rule (a) Cross entropy Ep(x|Om) log qm for dimension 3, 5 and 8 (b) Squared MMD with RBF kernel for dimension 3, 5 and 8 Figure 2: Experimental results for Gaussian model. We use 256 obtained particles as samples from p(x|Om) and compare it with true posteriors. Each evaluation result is computed over 25 tasks and the shaded area shows the standard error. More results in Appendix D. Figure 3: Visualization of the evolution of posterior density from left to right. In the end, PFBR density is closer to the true density than SMC density. mate the marginal posterior distribution p(xm|Om). For instance, consider a Gaussian linear dynamical system (LDS): xm = Axm 1 + ϵm, om = Bxm + δm, where ϵm N(0, Σ1) and δm N(0, Σ2). The particles can be updated by recursively applying two steps: (i) xn m = Axn m 1 + ϵm, (ii) xn m = F( Xm, xn m, om+1), where Xm := { xn m}N n=1. The second step is a Bayesian update from p(xm|Om 1) to p(xm|Om) given the likelihood p(om|xm). The only tricky part is we do not have a tractable loss function in this case because of the integration p(xm|Om 1) = R p(xm|xm 1)p(xm 1|Om 1) dxm 1. Hence, at each stage m, we use the particles Xm to construct an empirical density estimation ˆπ(x; Xm) as defined in Eq. (18) and then define the loss at each stage m as PN n=1 log qm(xn m) log p(om|xn m) log ˆπ(xn m; Xm), where we replace the intractable density p(xm|Om 1) by ˆπ. Given this loss, the PFBR operator F can be learned. In the experiment, we sample a pair of 2 dimensional random matrices A and B for the LDS model. We learn both PFBR and KBR from a training set containing multiple different sequences of observations. For KBR, we use an adapted version called Kernel Monte Carlo Filter (Kanagawa et al., 2013), which is designed for state-space models. We use Kalman filter (Welch & Bishop, 2006) to compute the true marginal posterior for evaluation purpose. Fig. 4 compares our method with KBR and SMC (importance sampling with resampling) on a testing set containing 25 new sequences of ordered observations. We see that our learned PFBR can generalize to test sequences and achieve better and stabler performances. 0 5 10 15 20 25 number of observations cross-entropy 3.25 3.50 3.75 64 128 256 512 1024 paricle size cross-entropy (a) Comparison of cross entropy 0 5 10 15 20 25 number of observations 64 128 256 512 1024 paricle size (b) Squared MMD with RBK kernel 0 5 10 15 20 25 number of observations |Ep[x TAx] Eq[x TAx]| 0.5 0.6 KBR 64 128 256 512 1024 paricle size |Ep[x TAx] Eq[x TAx]| (c) Integral estimation on test function Figure 4: Experimental results for LDS. Only results evaluated on the testing set are shown. Left: estimation errors on every stage m [25]. Right: estimation errors for different particle sizes, which are firstly averaged over 25 stages for each task, and then averaged over 25 tasks. The error bar shows the standard error over tasks. We use the same PFBR operator trained with 1024 particles, even though during testing phase, we apply it on particles of difference sizes. See Appendix D for more results. Comparison to Variational SMC. Autoencoding SMC (AESMC), Filtering Variational Objectives, and Variational SMC are three recent variational inference approaches that approximate the posterior based on SMC and a neural proposal (Le et al., 2018; Maddison et al., 2017; Naesseth et al., 2018). Since they share similar ideas, we implemented AESMC as a representative. We tried both MLP and GRU as mentioned in these papers. A comparison is made for 10-dimensional LDS (Table 1), which shows PFBR is better even with much fewer particles. Inference Time Comparison. Table 1 also shows the inference time for updating posterior given one new observation. Though PFBR takes more time for the same #particles (e.g.256), to get closer to PFBR s performance, others need to use much more particles (e.g. 4096). Bayesian Logistic Regression (BLR). We consider logistic regression for digits classification on the MNIST8M 8 Particle Flow Bayes Rule Algo #particles cpu time (s) gpu time (s) cross-entropy PFBR 256 0.23 0.26 16.56 SMC 256 0.07 0.02 26.78 ASMC-mlp 256 0.17 0.07 19.66 ASMC-gru 256 0.18 0.07 19.38 ASMC-mlp 4096 2.23 0.25 17.63 ASMC-gru 4096 2.26 0.26 17.24 SMC 8192 3.87 0.12 17.60 Table 1: Numbers are averaged over 25 sequences with 25 observations each. For PFBR, gpu is faster when #particles is larger. vs. 6 dataset which contains about 1.6M training samples and 1932 testing samples. We reduce the dimension of the images to 50 by PCA, following the same setting as Dai et al. (2016). Two experiments are conducted on this dataset. For both experiments, we compare our method with SMC, SVI (stochastic variational inference (Hoffman et al., 2012)), PMD (particle mirror descent (Dai et al., 2016)), SGD Langevin (stochastic gradient Langevin dynamics (Welling & Teh, 2011)) and SGD NPV (stochastic version of nonparametric variational inference (Gershman et al., 2012)). This is a large dataset, so we use the techniques discussed in Section 3.3 to facilitate training efficiency. BLR-Meta Learning. In the first experiment, we create a multi-task environment by rotating the first and second components of the features reduced by PCA through an angle ψ uniformly sampled from [ 15 , 15 ]. Note that the first two components account for more variability in the data. With a different rotation angle ψ, the classification boundary will change and thus a different classification task will be created. Also, a different sequence of image samples will result in different posterior distributions and thus corresponds to a different inference task. We learn the operator F from a set of training tasks, where each task corresponds to a different rotation angle ψ. After that, we use F as Bayes Rule for testing tasks and compare its performances with other stochastic methods or sampling methods. Test is done in an online fashion: all algorithms start with a set of particles sampled from the prior (hence the prediction accuracy at 0-th step is around 0.5). Each algorithm will make a prediction to the encountered batch of 32 images, and then observe their true labels. After that, each algorithm will update the particles and make a prediction to the next batch of images. Ideally we should compare the estimation of posteriors. However, since it is intractable, we evaluate the average prediction accuracy at each stage. Results are shown in Fig. 5. Note that we have conducted a sanity check to confirm the learned operator F does not ignore the first 2 rotated dimensions and use the rest 48 components to make predictions. More precisely, if we zero out the first two components of the data and learn F on them. The accuracy of the particles dropps to around 65%. This further verifies that the learned PFBR indeed can generalize across different tasks. 0 500 1000 1500 2000 number of observations average accuracy PFBR SMC PMD SVI SGD Langevin SGD NPV 50 100 150 200 250 300 number of observations batch accuracy Figure 5: Bayesian logistic regression on MNIST. Left: The average online prediction accuracy 1 m Pm t=1 rt is evaluated, where rt is the accuracy for the t-th batch of images. The shaded area presents standard deviation of results over 10 testing tasks. Right: We collect some examples when the random initialization is farther from the posterior and gives worse initial prediction. PFBR F updates those particles to gradually achieve a higher accuracy. BLR-Variational Inference. For the second experiment on MNIST, we use PFBR as a variational inference method. That is to say, instead of accurately learning a generalizable Bayesian operator, the estimation of the posterior p(x|Otrain) is of more importance. Thus here we do not use the loss function in Eq. (17) summing over all intermediate states, but emphasize more on the final error KL(q(x)||p(x|Otrain)). After the training is finished, we only use the transported particles to perform classification on the test set but do not further use the operator F. The result is shown in Fig. 6, where x-axis shows number of visited samples during training. Since we use a batch size of 128 and consider 10 stages, the first gradient step of our method starts after around 103 samples are visited. 102 103 104 nu Pber o I ob Verv Dt Lon V te Vt error 2ne-3DVV S0C 30D SGD 13V SGD LDngev Ln SVI 03F PFBR Figure 6: PFBR as a variational inference method. The prediction accuracy of PFBR is comparable with state-of-art variational inference and sampling methods. 5. Conclusion and Future Work In this paper, we have explored the possibility of learning an ODE-based Bayesian operator that can perform online Bayesian inference in testing phase and verified its generalization ability through both synthetic and real-world experiments. Further investigation on the parameterization of flow velocity f (e.g. use a stable neural architecture (Haber & Ruthotto, 2018) with Hyper Network (Ha et al., 2017)) and generating diverse prior distributions through a Dirichlet process can be made to explore a potentially better solution to this challenging problem. Particle Flow Bayes Rule Acknowledgements This project was supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS1350983, NSF IIS-1639792 EAGER, NSF IIS-1841351 EAGER, NSF CNS-1704701, ONR N00014-15-1-2340, Intel ISTC, NVIDIA, Google and Amazon AWS. Andrieu, C., de Freitas, N., Doucet, A., and Jordan, M. I. An introduction to mcmc for machine learning. Machine Learning, 50:5 43, 2003. Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., and de Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981 3989, 2016. Antoniak, C. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2:1152 1174, 1974. Balakrishnan, S. and Madigan, D. A one-pass sequential monte carlo method for bayesian analysis of massive datasets. Bayesian Analysis, 1(2):345 361, 06 2006. Batchelor, G. Kinematics of the Flow Field, pp. 71 130. Cambridge Mathematical Library. Cambridge University Press, 2000. Canini, K. R., Shi, L., and Griff iths, T. L. Online inference of topics with latent dirichlet allocation. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), 2009. Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 6572 6583. Curran Associates, Inc., 2018. Chopin, N., Jacob, P. E., and Papaspiliopoulos, O. Smc2: an efficient algorithm for sequential analysis of state space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):397 426, 2013. Dai, B., He, N., Dai, H., and Song, L. Provable bayesian inference via particle mirror descent. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 985 994, 2016. Dai, H., Khalil, E. B., Zhang, Y., Dilkina, B., and Song, L. Learning combinatorial optimization algorithms over graphs. ar Xiv preprint ar Xiv:1704.01665, 2017. Daum, F. and Huang, J. Curse of dimensionality and particle filters. In 2003 IEEE Aerospace Conference Proceedings, volume 4, 2003. Doucet, A., de Freitas, N., and Gordon, N. Sequential Monte Carlo Methods in Practice. Springer-Verlag, 2001. Dreyfus, S. Some types of optimal control of stochastic systems. Journal of the Society for Industrial and Applied Mathematics Series A Control, 2(1):120 134, 1964. doi: 10.1137/0302010. Fukumizu, K., Song, L., and Gretton, A. Kernel Bayes rule: Bayesian inference with positive definite kernels. In accepted to Journal of Machine Learning Research (JMLR), 2012. Gershman, S., Hoffman, M., and Blei, D. M. Nonparametric variational inference. In Langford, J. and Pineau, J. (eds.), Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 663 670, New York, NY, USA, 2012. ACM. Grathwohl, W., Chen, R. T., Betterncourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367, 2018. Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., and Smola, A. A kernel two-sample test. JMLR, 13:723 773, 2012. Ha, D., Dai, A., and Le, Q. V. Hyper Networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2017. Haber, E. and Ruthotto, L. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2018. Hoffman, M., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. In International Conference on Machine Learning, 2012. Jordan, R., Kinderlehrer, D., and Otto, F. The variational formulation of the fokker planck equation. SIAM Journal on Mathematical Analysis, 29(1):1 17, 1998. Kanagawa, M., Nishiyama, Y., Gretton, A., and Fukumizu, K. Filtering with state-observation examples via kernel monte carlo filter. ar Xiv e-prints, December 2013. Le, T. A., Igl, M., Rainforth, T., Jin, T., and Wood, F. Autoencoding sequential monte carlo. In International Conference on Learning Representations, 2018. Lei, N., Su, K., Cui, L., Yau, S.-T., and Xianfeng Gu, D. A Geometric View of Optimal Transportation and Generative Model. ar Xiv e-prints, art. ar Xiv:1710.05488, October 2017. Particle Flow Bayes Rule Maddison, C. J., Lawson, J., Tucker, G., Heess, N., Norouzi, M., Mnih, A., Doucet, A., and Teh, Y. Filtering variational objectives. In Advances in Neural Information Processing Systems, pp. 6573 6583, 2017. Naesseth, C., Linderman, S., Ranganath, R., and Blei, D. Variational sequential monte carlo. In International Conference on Artificial Intelligence and Statistics, 2018. Smola, A., Gretton, A., Song, L., and Sch olkopf, B. A hilbert space embedding for distributions. In Algorithmic learning theory, pp. 13 31. Springer, 2007. Snyder, C., Bengtsson, T., Bickel, P., and Anderson, J. Obstacles to high-dimensional particle filtering. Monthly Weather Review, 136(12):4629 4640, 2008. Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley, Department of Statistics, September 2003. Welch, G. and Bishop, G. An introduction to the kalman filter. Technical Report TR-95-041, Department of Computer Science, University of North Carolina at Chapel Hill, 2006. Welling, M. and Teh, Y.-W. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning (ICML), pp. 681 688, 2011. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, pp. 3391 3401, 2017. Zhang, L., E, W., and Wang, L. Monge-amp\ere flow for generative modeling. ar Xiv preprint ar Xiv:1809.10188, 2018.