# kernel_bayesian_inference_with_posterior_regularization__fb0f4fcd.pdf Kernel Bayesian Inference with Posterior Regularization Yang Song , Jun Zhu , Yong Ren Dept. of Physics, Tsinghua University, Beijing, China Dept. of Comp. Sci. & Tech., TNList Lab; Center for Bio-Inspired Computing Research State Key Lab for Intell. Tech. & Systems, Tsinghua University, Beijing, China yangsong@cs.stanford.edu; {dcszj@, renyong15@mails}.tsinghua.edu.cn We propose a vector-valued regression problem whose solution is equivalent to the reproducing kernel Hilbert space (RKHS) embedding of the Bayesian posterior distribution. This equivalence provides a new understanding of kernel Bayesian inference. Moreover, the optimization problem induces a new regularization for the posterior embedding estimator, which is faster and has comparable performance to the squared regularization in kernel Bayes rule. This regularization coincides with a former thresholding approach used in kernel POMDPs whose consistency remains to be established. Our theoretical work solves this open problem and provides consistency analysis in regression settings. Based on our optimizational formulation, we propose a flexible Bayesian posterior regularization framework which for the first time enables us to put regularization at the distribution level. We apply this method to nonparametric state-space filtering tasks with extremely nonlinear dynamics and show performance gains over all other baselines. 1 Introduction Kernel methods have long been effective in generalizing linear statistical approaches to nonlinear cases by embedding a sample to the reproducing kernel Hilbert space (RKHS) [1]. In recent years, the idea has been generalized to embedding probability distributions [2, 3]. Such embeddings of probability measures are usually called kernel embeddings (a.k.a. kernel means). Moreover, [4, 5, 6] show that statistical operations of distributions can be realized in RKHS by manipulating kernel embeddings via linear operators. This approach has been applied to various statistical inference and learning problems, including training hidden Markov models (HMM) [7], belief propagation (BP) in tree graphical models [8], planning Markov decision processes (MDP) [9] and partially observed Markov decision processes (POMDP) [10]. One of the key workhorses in the above applications is the kernel Bayes rule [5], which establishes the relation among the RKHS representations of the priors, likelihood functions and posterior distributions. Despite empirical success, the characterization of kernel Bayes rule remains largely incomplete. For example, it is unclear how the estimators of the posterior distribution embeddings relate to optimizers of some loss functions, though the vanilla Bayes rule has a nice connection [11]. This makes generalizing the results especially difficult and hinters the intuitive understanding of kernel Bayes rule. To alleviate this weakness, we propose a vector-valued regression [12] problem whose optimizer is the posterior distribution embedding. This new formulation is inspired by the progress in two fields: 1) the alternative characterization of conditional embeddings as regressors [13], and 2) the Corresponding author. introduction of posterior regularized Bayesian inference (Reg Bayes) [14] based on an optimizational reformulation of the Bayes rule. We demonstrate the novelty of our formulation by providing a new understanding of kernel Bayesian inference, with theoretical, algorithmic and practical implications. On the theoretical side, we are able to prove the (weak) consistency of the estimator obtained by solving the vector-valued regression problem under reasonable assumptions. As a side product, our proof can be applied to a thresholding technique used in [10], whose consistency is left as an open problem. On the algorithmic side, we propose a new regularization technique, which is shown to run faster and has comparable accuracy to squared regularization used in the original kernel Bayes rule [5]. Similar in spirit to Reg Bayes, we are also able to derive an extended version of the embeddings by directly imposing regularization on the posterior distributions. We call this new framework k Reg Bayes. Thanks to RKHS embeddings of distributions, this is the first time, to the best of our knowledge, people can do posterior regularization without invoking linear functionals (such as moments) of the random variables. On the practical side, we demonstrate the efficacy of our methods on both simple and complicated synthetic state-space filtering datasets. Same to other algorithms based on kernel embeddings, our kernel regularized Bayesian inference framework is nonparametric and general. The algorithm is nonparametric, because the priors, posterior distributions and likelihood functions are all characterized by weighted sums of data samples. Hence it does not need the explicit mechanism such as differential equations of a robot arm in filtering tasks. It is general in terms of being applicable to a broad variety of domains as long as the kernels can be defined, such as strings, orthonormal matrices, permutations and graphs. 2 Preliminaries 2.1 Kernel embeddings Let (X, BX ) be a measurable space of random variables, p X be the associated probability measure and HX be a RKHS with kernel k( , ). We define the kernel embedding of p X to be µX = Ep X[φ(X)] HX , where φ(X) = k(X, ) is the feature map. Such a vector-valued expectation always exists if the kernel is bounded, namely supx k X (x, x) < . The concept of kernel embeddings has several important statistical merits. Inasmuch as the reproducing property, the expectation of f H w.r.t. p X can be easily computed as Ep X[f(X)] = Ep X[ f, φ(X) ] = f, µX . There exists universal kernels [15] whose corresponding RKHS H is dense in CX in terms of sup norm. This means H contains a rich range of functions f and their expectations can be computed by inner products without invoking usually intractable integrals. In addition, the inner product structure of the embedding space H provides a natural way to measure the differences of distributions through norms. In much the same way we can define kernel embeddings of linear operators. Let (X, BX ) and (Y, BY) be two measurable spaces, φ(x) and ψ(y) be the measurable feature maps of corresponding RKHS HX and HY with bounded kernels, and p denote the joint distribution of a random variable (X, Y ) on X Y with product measures. The covariance operator CXY is defined as CXY = Ep[φ(X) ψ(Y )], where denotes the tensor product. Note that it is possible to identify CXY with µ(XY ) in HX HY with the kernel function k((x1, y1), (x2, y2)) = k X (x1, x2)k Y(y1, y2) [16]. There is an important relation between kernel embeddings of distributions and covariance operators, which is fundamental for the sequel: Theorem 1 ([4, 5]). Let µX, µY be the kernel embeddings of p X and p Y respectively. If CXX is injective, µX R(CXX) and E[g(Y ) | X = ] HX for all g HY, then µY = CY XC 1 XXµX. (1) In addition, µY |X=x = E[ψ(Y )|X = x] = CY XC 1 XXφ(x). On the implementation side, we need to estimate these kernel embeddings via samples. An intuitive estimator for the embedding µX is bµX = 1 N PN i=1 φ(xi), where {xi}N i=1 is a sample from p X. Similarly, the covariance operators can also be estimated by b CXY = 1 N PN i=1 φ(xi) ψ(yi). Both operators are shown to converge in the RKHS norm at a rate of Op(N 1 2.2 Kernel Bayes rule Let π(Y ) be the prior distribution of a random variable Y , p(X = x | Y ) be the likelihood, pπ(Y | X = x) be the posterior distribution given π(Y ) and observation x, and pπ(X, Y ) be the joint distribution incorporating π(Y ) and p(X | Y ). Kernel Bayesian inference aims to obtain the posterior embedding µπ Y (X = x) given a prior embedding πY and a covariance operator CXY . By Bayes rule, pπ(Y | X = x) π(Y )p(X = x | Y ). We assume that there exists a joint distribution p on X Y whose conditional distribution matches p(X | Y ) and let CXY be its covariance operator. Note that we do not require p = pπ hence p can be any convenient distribution. According to Thm. 1, µπ Y (X = x) = Cπ Y XCπ XX 1φ(x), where Cπ Y X corresponds to the joint distribution pπ and Cπ XX to the marginal probability of pπ on X. Recall that Cπ Y X can be identified with µ(Y X) in HY HX , we can apply Thm. 1 to obtain µ(Y X) = C(Y X)Y C 1 Y Y πY , where C(Y X)Y := E[ψ(Y ) φ(X) ψ(Y )]. Similarly, Cπ XX can be represented as µ(XX) = C(XX)Y C 1 Y Y πY . This way of computing posterior embeddings is called the kernel Bayes rule [5]. Given estimators of the prior embedding bπY = Pm i=1 αiψ(yi) and the covariance operator b CY X, The posterior embedding can be obtained via bµπ Y (X = x) = b Cπ Y X([ b Cπ XX]2 + λI) 1 b Cπ XXφ(x) , where squared regularization is added to the inversion. Note that the regularization for bµπ Y (X = x) is not unique. A thresholding alternative is proposed in [10] without establishing the consistency. We will discuss this thresholding regularization in a different perspective and give consistency results in the sequel. 2.3 Regularized Bayesian inference Regularized Bayesian inference (Reg Bayes [14]) is based on a variational formulation of the Bayes rule [11]. The posterior distribution can be viewed as the solution of minp(Y |X=x) KL(p(Y |X = x) π(Y )) R log p(X = x|Y )dp(Y |X = x), subjected to p(Y |X = x) Pprob, where Pprob is the set of valid probability measures. Reg Bayes combines this formulation and posterior regularization [17] in the following way min p(Y |X=x),ξ KL(p(Y |X = x) π(Y )) Z log p(X = x|Y )dp(Y |X = x) + U(ξ) s.t. p(Y |X = x) Pprob(ξ), where Pprob(ξ) is a subset depending on ξ and U(ξ) is a loss function. Such a formulation makes it possible to regularize Bayesian posterior distributions, smoothing the gap between Bayesian generative models and discriminative models. Related applications include max-margin topic models [18] and infinite latent SVMs [14]. Despite the flexibility of Reg Bayes, regularization on the posterior distributions is practically imposed indirectly via expectations of a function. We shall see soon in the sequel that our new framework of kernel Regularized Bayesian inference can control the posterior distribution in a direct way. 2.4 Vector-valued regression The main task for vector-valued regression [12] is to minimize the following objective i=1 yj f(xj) 2 HY + λ f 2 HK , where yj HY, f : X HY. Note that f is a function with RKHS values and we assume that f belongs to a vector-valued RKHS HK. In vector-valued RKHS, the kernel function k is generalized to linear operators L(HY) K(x1, x2) : HY HY, such that K(x1, x2)y := (Kx2y)(x1) for every x1, x2 X and y HY, where Kx2y HK. The reproducing property is generalized to y, f(x) HY = Kxy, f HK for every y HY, f HK and x X. In addition, [12] shows that the representer theorem still holds for vector-valued RKHS. 3 Kernel Bayesian inference as a regression problem One of the unique merits of the posterior embedding µπ Y (X = x) is that expectations w.r.t. posterior distributions can be computed via inner products, i.e., h, µπ Y (X = x) = Epπ(Y |X=x)[h(Y )] for all h HY. Since µπ Y (X = x) HY, µπ Y can be viewed as an element of a vector-valued RKHS HK containing functions f : X HY. A natural optimization objective [13] thus follows from the above observations E[µ] := sup h Y 1 EX (EY [h(Y )|X] h, µ(X) HY)2 , (2) where EX[ ] denotes the expectation w.r.t. pπ(X) and EY [ |X] denotes the expectation w.r.t. the Bayesian posterior distribution, i.e., pπ(Y | X) π(Y )p(X | Y ). Clearly, µπ Y = arg infµ E[µ]. Following [13], we introduce an upper bound Es for E by applying Jensen s and Cauchy-Schwarz s inequalities consecutively Es[µ] := E(X,Y )[ ψ(Y ) µ(X) 2 HY], (3) where (X, Y ) is the random variable on X Y with the joint distribution pπ(X, Y ) = π(Y )p(X | Y ). The first step to make this optimizational framework practical is to find finite sample estimators of Es[µ]. We will show how to do this in the following section. 3.1 A consistent estimator of Es[µ] Unlike the conditional embeddings in [13], we do not have i.i.d. samples from the joint distribution pπ(X, Y ), as the priors and likelihood functions are represented with samples from different distributions. We will eliminate this problem using a kernel trick, which is one of our main innovations in this paper. The idea is to use the inner product property of a kernel embedding µ(X,Y ) to represent the expectation E(X,Y )[ ψ(Y ) µ(X) 2 HY] and then use finite sample estimators of µ(X,Y ) to estimate Es[µ]. Recall that we can identify CXY := EXY [φ(X) ψ(Y )] with µ(X,Y ) in a product space HX HY with a product kernel k X k Y on X Y [16]. Let f(x, y) = ψ(y) µ(x) 2 HY and assume that f HX HY. The optimization objective Es[µ] can be written as Es[µ] = E(X,Y )[ ψ(Y ) µ(X) 2 HY] = f, µ(X,Y ) HX HY. (4) From Thm. 1, we assert that µ(X,Y ) = C(X,Y )Y C 1 Y Y πY and a natural estimator follows to be bµ(X,Y ) = b C(X,Y )Y ( b CY Y + λI) 1bπY . As a result, b Es[µ] := bµ(X,Y ), f HX HY and we introduce the following proposition to write b Es in terms of Gram matrices. Proposition 1 (Proof in Appendix). Suppose (X, Y ) is a random variable in X Y, where the prior for Y is π(Y ) and the likelihood is p(X | Y ). Let HX be a RKHS with kernel k X and feature map φ(x), HY be a RKHS with kernel k Y and feature map ψ(y), φ(x, y) be the feature map of HX HY, bπY = Pl i=1 αiψ( yi) be a consistent estimator of πY and {(xi, yi)}n i=1 be a sample representing p(X | Y ). Under the assumption that f(x, y) = ψ(y) µ(x) 2 HY HX HY, we have i=1 βi ψ(yi) µ(xi) 2 HY , (5) where β = (β1, , βn) is given by β = (GY + nλI) 1 GY α, where (GY )ij = k Y(yi, yj), ( GY )ij = k Y(yi, yj), and α = ( α1, , αl) . The consistency of b Es[µ] is a direct consequence of the following theorem adapted from [5], since the Cauchy-Schwarz inequality ensures | µ(X,Y ), f bµ(X,Y ), f | µ(X,Y ) bµ(X,Y ) f . Theorem 2 (Adapted from [5], Theorem 8). Assume that CY Y is injective, bπY is a consistent estimator of πY in HY norm, and that E[k((X, Y ), ( X, Y )) | Y = y, Y = y] is included in HY HY as a function of (y, y), where ( X, Y ) is an independent copy of (X, Y ). Then, if the regularization coefficient λn decays to 0 sufficiently slowly, b C(X,Y )Y ( b CY Y + λn I) 1bπY µ(X,Y ) HX HY 0 (6) in probability as n . Although b Es[µ] is a consistent estimator of Es[µ], it does not necessarily have minima, since the coefficients βi can be negative. One of our main contributions in this paper is the discovery that we can ignore data points (xi, yi) with a negative βi, i.e., replacing βi with β+ i := max(0, βi) in b Es[µ]. We will give explanations and theoretical justifications in the next section. 3.2 The thresholding regularization We show in the following theorem that b E+ s [µ] := Pn i=1 β+ i ψ(yi) µ(xi) 2 converges to Es[µ] in probability in discrete situations. The trick of replacing βi with β+ i is named thresholding regularization. Theorem 3 (Proof in Appendix). Assume that X is compact and |Y| < , k is a strictly positive definite continuous kernel with sup(x,y) k((x, y), (x, y)) < κ and f(x, y) = ψ(y) µ(x) 2 HY HX HY. With the conditions in Thm. 2, we assert that bµ+ (X,Y ) is a consistent estimator of µ(X,Y ) and b E+ s [µ] Es[µ] 0 in probability as n . In the context of partially observed Markov decision processes (POMDPs) [10], a similar thresholding approach, combined with normalization, was proposed to make the Bellman operator isotonic and contractive. However, the authors left the consistency of that approach as an open problem. The justification of normalization has been provided in [13], Lemma 2.2 under the finite space assumption. A slight modification of our proof of Thm. 3 (change the probability space from X Y to X) can complete the other half as a side product, under the same assumptions. Compared to the original squared regularization used in [5], thresholding regularization is more computational efficient because 1) it does not need to multiply the Gram matrix twice, and 2) it does not need to take into consideration those data points with negative βi s. In many cases a large portion of {βi}n i=1 is negative but the sum of their absolute values is small. The finite space assumption in Thm. 3 may also be weakened, but it requires deeper theoretical analyses. 3.3 Minimizing b E+ s [µ] Following the standard steps of solving a RKHS regression problem, we add a Tikhonov regularization term to b E+ s [µ] to provide a well-proposed problem, b Eλ,n[µ] = i=1 β+ i ψ(yi) µ(xi) 2 HY + λ µ 2 HK . (7) Let bµλ,n = arg minµ b Eλ,n[µ]. Note that b Eλ,n[µ] is a vector-valued regression problem, and the representer theorems in vector-valued RKHS apply here. We summarize the matrix expression of bµλ,n in the following proposition. Proposition 2 (Proof in Appendix). Without loss of generality, we assume that β+ i = 0 for all 1 i n. Let µ HK and choose the kernel of HK to be K(xi, xj) = k X (xi, xj)I, where I : HK HK is an identity map. Then bµλ,n(x) = Ψ(KX + λnΛ+) 1K:x, (8) where Ψ = (ψ(y1), , ψ(yn)), (KX)ij = k X (xi, xj), Λ+ = diag(1/β+ 1 , , 1/β+ n ), K:x = (k X (x, x1), , k X (x, xn)) and λn is a positive regularization constant. 3.4 Theoretical justifications for bµλ,n In this section, we provide theoretical explanations for using bµλ,n as an estimator of the posterior embedding under specific assumptions. Let µ = arg minµ E[µ], µ = arg minµ Es[µ], and recall that bµλ,n = arg minµ b Eλ,n[µ]. We first show the relations between µ and µ and then discuss the relations between bµλ,n and µ . The forms of E and Es are exactly the same for posterior kernel embeddings and conditional kernel embeddings. As a consequence, the following theorem in [13] still hold. Theorem 4 ([13]). If there exists a µ HK such that for any h HY, E[h|X] = h, µ (X) HY p X-a.s., then µ is the p X-a.s. unique minimiser of both objectives: µ = arg min µ HK E[µ] = arg min µ HK Es[µ]. This theorem shows that if the vector-valued RKHS HK is rich enough to contain µπ Y |X=x, both E and Es can lead us to the correct embedding. In this case, it is reasonable to use µ instead of µ . For the situation where µπ Y |X=x HK, we refer the readers to [13]. Unfortunately, we cannot obtain the relation between bµλ,n and µ by referring to [19], as in [13]. The main difficulty here is that {(xi, yi)}|n i=1 is not an i.i.d. sample from pπ(X, Y ) = π(Y )p(X | Y ) and the estimator b E+ s [µ] does not use i.i.d. samples to estimate expectations. Therefore the concentration inequality ([19], Prop. 2) used in the proofs of [19] cannot be applied. To solve the problem, we propose Thm. 9 (in Appendix) which can lead to a consistency proof for bµλ,n. The relation between bµλ,n and µ can now be summarized in the following theorem. Theorem 5 (Proof in Appendix). Assume Hypothesis 1 and Hypothesis 2 in [20] and our Assumption 1 (in the Appendix) hold. With the conditions in Thm. 3, we assert that if λn decreases to 0 sufficiently slowly, Es[bµλn,n] Es[µ ] 0 (9) in probability as n . 4 Kernel Bayesian inference with posterior regularization Based on our optimizational formulation of kernel Bayesian inference, we can add additional regularization terms to control the posterior embeddings. This technique gives us the possibility to incorporate rich side information from domain knowledge and to enforce supervisions on Bayesian inference. We call our framework of imposing posterior regularization k Reg Bayes. As an example of the framework, we study the following optimization problem i=1 β+ i µ(xi) ψ(yi) 2 HY + λ µ 2 HK | {z } b Eλ,n[µ] i=m+1 µ(xi) ψ(ti) 2 HY | {z } The regularization term where {(xi, yi)}m i=1 is the sample used for representing the likelihood, {(xi, ti)}n i=m+1 is the sample used for posterior regularization and λ, δ are the regularization constants. Note that in RKHS embeddings, ψ(t) is identified as a point distribution at t [2]. Hence the regularization term in (10) encourages the posterior distributions p(Y | X = xi) to be concentrated at ti. More complicated regularization terms are also possible, such as µ(xi) Pl i=1 αiψ(ti) HY. Compared to vanilla Reg Bayes, our kernel counterpart has several obvious advantages. First, the difference between two distributions can be naturally measured by RKHS norms. This makes it possible to regularize the posterior distribution as a whole, rather than through expectations of discriminant functions. Second, the framework of kernel Bayesian inference is totally nonparametric, where the priors and likelihood functions are all represented by respective samples. We will further demonstrate the properties of k Reg Bayes through experiments in the next section. Let bµreg = arg minµ L. It is clear that solving L is substantially the same as b Eλ,n[µ] and we summarize it in the following proposition. Proposition 3. With the conditions in Prop. 2, we have bµreg(x) = Ψ(KX + λΛ+) 1K:x, (11) where Ψ = (ψ(y1), , ψ(yn)), (KX)ij = k X (xi, xj)|1 i,j n, Λ+ = diag(1/β+ 1 , , 1/β+ m, 1/δ, , 1/δ), and K:x = (k X (x, x1), , k X (x, xn)) . 5 Experiments In this section, we compare the results of k Reg Bayes and several other baselines for two state-space filtering tasks. The mechanism behind kernel filtering is stated in [5] and we provide a detailed introduction in Appendix, including all the formula used in implementation. Toy dynamics This experiment is a twist of that used in [5]. We report the results of extended Kalman filter (EKF) [21] and unscented Kalman filter (UKF) [22], kernel Bayes rule (KBR) [5], kernel Bayesian learning with thresholding regularization (p KBR) and k Reg Bayes. The data points {(θt, xt, yt)} are generated from the dynamics θt+1 = θt + 0.4 + ξt (mod 2π), xt+1 yt+1 = (1 + sin(8θt+1)) cos θt+1 sin θt+1 where θt is the hidden state, (xt, yt) is the observation, ξt N(0, 0.04) and ζt N(0, 0.04). Note that this dynamics is nonlinear for both transition and observation functions. The observation model is an oscillation around the unit circle. There are 1000 training data and 200 validation/test data for each algorithm. Figure 1: Mean running MSEs against time steps for each algorithm. (Best view in color) We suppose that EKF, UKF and k Reg Bayes know the true dynamics of the model and the first hidden state θ1. In this case, we use θt+1 = θ1 + 0.4t (mod 2π) and ( xt+1, yt+1) = (1 + sin(8 θt+1))(cos θt+1, sin θt+1) as the supervision data point for the (t + 1)-th step. We follow [5] to set our parameters. The results are summarized in Fig. 5. p KBR has lower errors compared to KBR, which means the thresholding regularization is practically no worse than the original squared regularization. The lower MSE of k Reg Bayes compared with p KBR shows that the posterior regularization successfully incorporates information from equations of the dynamics. Moreover, p KBR and k Reg Bayes run faster than KBR. The total running times for 50 random datasets of p KBR, k Reg Bayes and KBR are respectively 601.3s, 677.5s and 3667.4s. Camera position recovery In this experiment, we build a scene containing a table and a chair, which is derived from classchair.pov (http://www.oyonale.com). With a fixed focal point, the position of the camera uniquely determines the view of the scene. The task of this experiment is to estimate the position of the camera given the image. This is a problem with practical applications in remote sensing and robotics. We vary the position of the camera in a plane with a fixed height. The transition equations of the hidden states are θt+1 = θt+0.2+ξθ, rt+1 = max(R2, min(R1, rt+ξr)), xt+1 = cos θt+1, yt+1 = sin θt+1, where ξθ N(0, 4e 4), ξr N(0, 1), 0 R1 < R2 are two constants and {(xt, yt)}|m t=1 are treated as the hidden variables. As the observation at t-th step, we render a 100 100 image with the camera located at (xt, yt). For training data, we set R1 = 0 and R2 = 10 while for validation data and test data we set R1 = 5 and R2 = 7. The motivation is to distinguish the efficacy of enforcing the posterior distribution to concentrate around distance 6 by k Reg Bayes. We show a sample set of training and test images in Fig. 2. We compare KBR, p KBR and k Reg Bayes with the traditional linear Kalman filter (KF [23]). Following [4] we down-sample the images and train a linear regressor for observation model. In all experiments, we flatten the images to a column vector and apply Gaussian RBF kernels if needed. The kernel band widths are set to be the median distances in the training data. Based on experiments on the validation dataset, we set λT = 1e 6 = 2δT and µT = 1e 5. Figure 2: First several frames of training data (upper row) and test data (lower row). Figure 3: (a) MSEs for different algorithms (best view in color). Since KF performs much worse than kernel filters, we use a different scale and plot it on the right y-axis. (b) Probability histograms for the distance between each state and the scene center. All algorithms use 100 training data. To provide supervision for k Reg Bayes, we uniformly generate 2000 data points {(ˆxi, ˆyt)}2000 i=1 on the circle r = 6. Given the previous estimate ( xt, yt), we first compute ˆθt = arctan(ˆyt/ˆxt) (where the value ˆθt is adapted according to the quadrant of (ˆxt, ˆyt)) and estimate ( xt+1, yt+1) = (cos(ˆθt + 0.4), sin(ˆθt + 0.4)). Next, we find the nearest point to ( xt+1, yt+1) in the supervision set ( xk, yk) and add the regularization µT µ(It+1) φ( xk, yk) to the posterior embedding, where It+1 denotes the (t + 1)-th image. We vary the size of training dataset from 100 to 300 and report the results of KBR, p KBR, k Reg Bayes and KF on 200 test images in Fig. 3. KF performs much worse than all three kernel filters due to the extreme non-linearity. The result of p KBR is a little worse than that of KBR, but the gap decreases as the training dataset becomes larger. k Reg Bayes always performs the best. Note that the advantage becomes less obvious as more data come. This is because kernel methods can learn the distance relation better with more data, and posterior regularization tends to be more useful when data are not abundant and domain knowledge matters. Furthermore, Fig. 3(b) shows that the posterior regularization helps the distances to concentrate. 6 Conclusions We propose an optimizational framework for kernel Bayesian inference. With thresholding regularization, the minimizer of the framework is shown to be a reasonable estimator of the posterior kernel embedding. In addition, we propose a posterior regularized kernel Bayesian inference framework called k Reg Bayes. These frameworks are applied to non-linear state-space filtering tasks and the results of different algorithms are compared extensively. Acknowledgements We thank all the anonymous reviewers for valuable suggestions. The work was supported by the National Basic Research Program (973 Program) of China (No. 2013CB329403), National NSF of China Projects (Nos. 61620106010, 61322308, 61332007), the Youth Top-notch Talent Support Program, and Tsinghua Initiative Scientific Research Program (No. 20141080934). [1] Alex J Smola and Bernhard Schölkopf. Learning with kernels. Citeseer, 1998. [2] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. [3] Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A hilbert space embedding for distributions. In Algorithmic learning theory, pages 13 31. Springer, 2007. [4] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 961 968. ACM, 2009. [5] Kenji Fukumizu, Le Song, and Arthur Gretton. Kernel bayes rule. In Advances in neural information processing systems, pages 1737 1745, 2011. [6] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. Signal Processing Magazine, IEEE, 30(4):98 111, 2013. [7] Le Song, Byron Boots, Sajid M Siddiqi, Geoffrey J Gordon, and Alex Smola. Hilbert space embeddings of hidden markov models. 2010. [8] Le Song, Arthur Gretton, and Carlos Guestrin. Nonparametric tree graphical models. In International Conference on Artificial Intelligence and Statistics, pages 765 772, 2010. [9] Steffen Grunewalder, Guy Lever, Luca Baldassarre, Massi Pontil, and Arthur Gretton. Modelling transition dynamics in mdps with rkhs embeddings. ar Xiv preprint ar Xiv:1206.4655, 2012. [10] Yu Nishiyama, Abdeslam Boularias, Arthur Gretton, and Kenji Fukumizu. Hilbert space embeddings of pomdps. ar Xiv preprint ar Xiv:1210.4887, 2012. [11] Peter M. Williams. Bayesian conditionalisation and the principle of minimum information. The British Journal for the Philosophy of Science, 31(2), 1980. [12] Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural computation, 17(1):177 204, 2005. [13] Steffen Grünewälder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massimiliano Pontil. Conditional mean embeddings as regressors. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1823 1830, 2012. [14] Jun Zhu, Ning Chen, and Eric P Xing. Bayesian inference with posterior regularization and applications to infinite latent svms. The Journal of Machine Learning Research, 15(1):1799 1847, 2014. [15] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. The Journal of Machine Learning Research, 7:2651 2667, 2006. [16] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337 404, 1950. [17] Kuzman Ganchev, Joao Graça, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for structured latent variable models. The Journal of Machine Learning Research, 11:2001 2049, 2010. [18] Jun Zhu, Amr Ahmed, and Eric Xing. Med LDA: Maximum margin supervised topic models. JMLR, 13:2237 2278, 2012. [19] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007. [20] Ernesto De Vito and Andrea Caponnetto. Risk bounds for regularized least-squares algorithm with operator-value kernels. Technical report, DTIC Document, 2005. [21] Simon J Julier and Jeffrey K Uhlmann. New extension of the kalman filter to nonlinear systems. In Aero Sense 97, pages 182 193. International Society for Optics and Photonics, 1997. [22] Eric A Wan and Ronell Van Der Merwe. The unscented kalman filter for nonlinear estimation. In Adaptive Systems for Signal Processing, Communications, and Control Symposium 2000. AS-SPCC. The IEEE 2000, pages 153 158. Ieee, 2000. [23] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35 45, 1960. [24] Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In Learning theory and kernel machines, pages 144 158. Springer, 2003. [25] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems, volume 375. Springer Science & Business Media, 1996.