# bayesian_deconditional_kernel_mean_embeddings__0386f3e8.pdf Bayesian Deconditional Kernel Mean Embeddings Kelvin Hsu 1 2 Fabio Ramos 1 3 Abstract Conditional kernel mean embeddings form an attractive nonparametric framework for representing conditional means of functions, describing the observation processes for many complex models. However, the recovery of the original underlying function of interest whose conditional mean was observed is a challenging inference task. We formalize deconditional kernel mean embeddings as a solution to this inverse problem, and show that it can be naturally viewed as a nonparametric Bayes rule. Critically, we introduce the notion of task transformed Gaussian processes and establish deconditional kernel means as their posterior predictive mean. This connection provides Bayesian interpretations and uncertainty estimates for deconditional kernel mean embeddings, explains their regularization hyperparameters, and reveals a marginal likelihood for kernel hyperparameter learning. These revelations further enable practical applications such as likelihood-free inference and learning sparse representations for big data. 1. Introduction Observations of complex phenomena often lead to likelihoods that are described by a conditional mean. A widely applicable setting where this occurs is collecting observations under uncertain inputs, where the task is to learn a function f : X R to model a real-valued response z as a function of inputs x X without being able to query or measure x directly to observe this phenomenon. Instead, another measured input y Y relates to x through p(x|y). Consequently, given y, the response Z has mean g(y) := E[f(X)|Y = y], where g is called the conditional mean of f. Furthermore, p(x|y) is often only available as sample pairs {xi, yi}n i=1, from simulations, algorithms, or separate experiments, making recovery of latent functions f from conditional means g a challenging inference task. 1University of Sydney 2CSIRO, Sydney 3NVIDIA, Seattle. Correspondence to: Kelvin Hsu . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Our first contribution begins with formulating deconditional mean embeddings (DMEs) as solutions to this inference problem by building upon the framework of conditional mean embeddings (CMEs) (Song et al., 2013). We show that the DME can be established as a nonparametric Bayes rule in the reproducing kernel Hilbert space (RKHS) and used for likelihood-free Bayesian inference. In contrast to kernel Bayes rule (KBR) (Fukumizu et al., 2013) which uses third order tensors that can result in vanishing priors, DMEs use second order tensors and avoids this problem. Together with CMEs and KBR, DMEs form a critical part of the kernel mean embedding (KME) (Muandet et al., 2017) framework, where probabilistic rules can be represented nonparametrically as operators that are linear in the RKHS. This greatly simplifies probabilistic inference without requiring parametrized distributions and compromising flexibility. Despite this connection, there are elements unique to the KME framework that cannot be interpreted or solved via the parallel between probability rules and RKHS mean operations. Similar to empirical forms for KBR and CMEs, empirical DMEs are obtained by replacing expectations in its constituent operators with their empirical means, and introduce regularization for operator inverses to relax RKHS assumptions, instead of as the optimal solution to a particular loss. Setting regularization hyperparameters is difficult in practice without an appropriate loss for the inference task. Furthermore, similar to KBR, the nonparametric Bayes rule provided by DMEs is a statement between observed (or simulated) variables and not on latent functions or quantities. Consequently, uncertainty estimation in inference of latent functions f still require a separate Bayesian formulation. Our second contribution establishes a Bayesian view of DMEs as posterior predictive means of the task transformed Gaussian process (TTGP), a novel nonparametric Bayesian model that recover latent relationships between variables without observing them jointly. TTGPs are so named because we show that they are a type of transformed Gaussian process (Murray-Smith & Pearlmutter, 2005) where the transformations and noise covariances are learned, by transforming one Gaussian process (GP) task to another, rather than designed from expert knowledge. We use this connection to derive posterior and predictive uncertainty estimates for DMEs and explain their regularization hyperparameters Bayesian Deconditional Kernel Mean Embeddings as a function of noise variance. Finally, we derive marginal likelihoods and their scalable computational forms to learn DME hyperparameters, which can also be applied to learn inducing points for sparse representations as a special case. All proofs are in the supplementary material. 2. Kernel Mean Embeddings We begin with an overview of the KME framework from which DMEs are built upon. KMEs are an arsenal of techniques concerned with representations and transformations of function expectations under highly flexible distributions. They consider functions that lie within RKHSs Hk and Hℓ, formed by positive definite kernels k : X X R and ℓ: Y Y R. The RKHSs Hk and Hℓare the closure span of the features φ(x) = k(x, ) and ψ(y) = ℓ(y, ) across x X and y Y respectively, endowed with the inner products , k , Hk and , ℓ , Hℓ. The key object is the mean embedding of a distribution µX := E[k(X, )] Hk. They encode function expectations in the sense that E[f(X)] = µX, f k, due to the reproducing property that k(x, ), f k = f(x) for all f Hk. Higher ordered mean embeddings are vital components of the framework. Specifically, second order mean embeddings such as CY Y := E[ℓ(Y, ) ℓ(Y, )] Hℓ Hℓand CXY := E[k(X, ) ℓ(Y, )] Hk Hℓcan be identified as crosscovariance operators CY Y : Hℓ Hℓand CXY : Hℓ Hk that serve as building blocks of CMEs and DMEs. In practical scenarios where only iid samples {xi, yi}n i=1 that are realizations of (Xi, Yi) PXY for i {1, . . . , n} are available, the KME framework becomes attractive for nonparametric inference because core objects only require expectations under distributions. Consequently, they can be estimated via empirical means as ˆµX := 1 n Pn i=1 k(xi, ), ˆCY Y := 1 n Pn i=1 ℓ(yi, ) ℓ(yi, ), and ˆCXY := 1 n Pn i=1 k(xi, ) ℓ(yi, ) (Muandet et al., 2017). For feature matrices, we stack features by columns Φ := φ(x1) φ(xn) and Ψ := ψ(y1) ψ(yn) . We write gram matrices as K := ΦT Φ and L := ΨT Ψ, where the (i, j)-th element of AT B is the inner product of the i-th column of A with the j-th column of B. That is, Kij = φ(xi)T φ(xj) and Lij = ψ(yi)T ψ(yj). When columns are elements of RKHSs such as when φ(x) = k(x, ) in Φ and ψ(y) = ℓ(y, ) in Ψ, the notation ( )T ( ) is a shorthand for the corresponding RKHS inner product , H when it is clear from context what H is. For example, f T h is shorthand for f, h k if f, h Hk. Another common usage is ΦT f = {φ(xi)T f}n i=1 = {k(xi, )T f}n i=1 = { k(xi, ), f k}n i=1 = {f(xi)}n i=1 =: f. For summing outer products, we write ˆCY Y = 1 nΨΨT and ˆCXY = 1 nΦΨT . Note that we use non-bold letters for single points x and y, even though they are often multivariate in practice. 3. Conditional Kernel Mean Embeddings We now present CMEs in a fashion that focuses on their operator properties. By reviewing CMEs this way, parallels and contrast with DMEs in the subsequent section 4 become more apparent. Importantly, instead of defining CMEs via an explicit form, we begin by forming problem statements. Definition 3.1 (Conditional Mean Problem Statement). Given a function f : X R, infer the function g : Y R such that g(y) = E[f(X)|Y = y] EX|Y [f](y). We call g the conditional mean of f with respect to PX|Y and write the shorthand g = EX|Y [f] = E[f(X)|Y = ]. This naturally leads to the notion of operators that map functions f to their conditional means g = E[f(X)|Y = ]. Definition 3.2 (Conditional Mean Operators). The conditional mean operator (CMO) CX|Y : Hℓ Hk corresponding to PX|Y is the operator that satisfies (CX|Y )T f = E[f(X)|Y = ], f Hk, (3.1) where (CX|Y )T : Hk Hℓdenotes the adjoint of CX|Y . Depending on the nature of ℓ, unique solutions exist. Theorem 3.1 ((Fukumizu et al., 2004)). Assume that ℓ(y, ) image(CY Y ) for all y Y. The conditional mean operator (CMO) CX|Y is unique and given by CX|Y = CXY C 1 Y Y . (3.2) The assumption that ℓ(y, ) image(CY Y ) for all y Y is commonly relaxed by introducing a regularization hyperparameter λ > 0 to the inverse, so that the CMO is replaced with CXY (CY Y + λI) 1 (Song et al., 2013). Contrary to definition 3.2, it is more common in the literature to define the CMO as the operator CX|Y that satisfies CX|Y ℓ(y, ) = E[k(X, )|Y = y], y Y, (3.3) while (3.1) is taken as an immediate property of CMOs (Fukumizu et al., 2004). However, due to lemma 3.2, we instead take definition 3.2 as the definition of CMOs, emphasizing CMOs as solutions to the conditional mean problem, and treat (3.3) as an immediate property. Lemma 3.2. Statements (3.1) and (3.3) are equivalent. The CME of PX|Y =y is µX|Y =y := CX|Y ℓ(y, ), equivalent to querying the CMO at a particular input y. Consequently, µX|Y =y, f k = CX|Y ℓ(y, ), f k = ℓ(y, ), (CX|Y )T f ℓ= ℓ(y, ), EX|Y [f] ℓ= EX|Y [f](y). Motivated by theorem 3.1, empirical CMOs and CMEs are defined by estimating their constituents by empirical means. Definition 3.3 (Empirical Conditional Mean Operator). The empirical CMO is ˆCX|Y := ˆCXY ( ˆCY Y + λI) 1, λ > 0. Bayesian Deconditional Kernel Mean Embeddings Theorem 3.3 ((Song et al., 2009)). The nonparametric form for ˆCX|Y is ˆCX|Y = Φ(L + nλI) 1ΨT . (3.4) The empirical CME is then ˆµX|Y =y := ˆCX|Y ℓ(y, ). Consequently, with ℓ(y) := {ℓ(yi, y)}n i=1, an estimate for EX|Y [f](y) is f, ˆµX|Y =y k = f, ˆCX|Y ℓ(y, ) k = f T Φ(L + nλI) 1ΨT ℓ(y, ) = f T (L + nλI) 1ℓ(y). Critically, while empirical CMOs (3.4) are estimated from joint samples from the joint distribution PXY , they only encode the conditional distribution PX|Y . This means that the empirical CMOs will encode the same conditional distribution even if the joint distribution PXY changes but the conditional distribution PX|Y stays the same. That is, the empirical CMO built from joint samples of p(x, y) = p(x|y)p(y) and the empirical CMO built from joint samples of q(x, y) := p(x|y)q(y) will encode the same conditional distribution p(x|y) and converge to the same CMO. 4. Deconditional Kernel Mean Embeddings We now present a novel class of KMEs referred to as deconditional mean embeddings (DMEs). They are natural counterparts to CMEs. The presentation of definitions and theorems in this section is mainly parallel to section 3. We define the deconditional mean problem as the task of recovering latent functions from their conditional means. Definition 4.1 (Deconditional Mean Problem Statement). Given a function g : Y R, infer a function f : X R such that g(y) = E[f(X)|Y = y]. We call f a deconditional mean of g with respect to PX|Y and write the shorthand f = E X|Y [g]. The deconditional mean of a function g infers the function f whose conditional mean would be g with respect to PX|Y . The corresponding operator that encodes this transformation is the deconditional mean operator (DMO). Definition 4.2 (Deconditional Mean Operators). The deconditional mean operator (DMO) C X|Y : Hk Hℓcorresponding to PX|Y is the operator that satisfies (C X|Y )T E[f(X)|Y = ] = f, f Hk. (4.1) Depending on the nature of ℓand k, unique solutions exist. Theorem 4.1. Assume that ℓ(y, ) image(CY Y ) for all y Y and k(x, ) image(CX|Y CY Y (CX|Y )T ) for all x X. The deconditional mean operator (DMO) C X|Y is unique and given by C X|Y = (CX|Y CY Y )T (CX|Y CY Y (CX|Y )T ) 1. (4.2) Similar to the case with CMOs (Song et al., 2013), the assumption that k(x, ) image(CX|Y CY Y (CX|Y )T ) for all x X can be relaxed by introducing a regularization hyperparameter ϵ > 0 to the inverse, so that the DMO is replaced with (CX|Y CY Y )T (CX|Y CY Y (CX|Y )T + ϵI) 1. Since DMOs invert the results of CMOs, they can also be understood as pseudo-inverses of CMOs. Theorem 4.2. If the assumptions in theorem 4.1 hold and further ((CX|Y )T CX|Y ) 1 exists such that the pseudoinverse C X|Y := ((CX|Y )T CX|Y ) 1(CX|Y )T is defined, then DMOs are pseudo-inverses of CMOs C X|Y = C X|Y . The DME of PX=x|Y is µ X=x|Y := C X|Y k(x, ) Hℓ, equivalent to querying the DMO at a particular input x. Consequently, µ X=x|Y , g ℓ = C X|Y k(x, ), g ℓ = k(x, ), (C X|Y )T g k = k(x, ), f k = f(x). The form in (4.2) makes it evident that a DMO can be fully specified once CX|Y and CY Y , encoding the measures PX|Y and PY respectively, are known. If densities exist, we write them as p X|Y p X|Y ( | ) and p Y p Y ( ), and drop the subscripts in density evaluations as p(x|y) and p(y) whenever the context is clear. Note that PX=x|Y corresponds to p X|Y (x| ) which is evaluated at x and now a function of y. This is in contrast with PX|Y =y corresponding to p X|Y ( |y) evaluated at y and now a function of x. Consider the case where X and Y play the roles of observed and unobserved (latent) variables respectively. The DMO considers the conditional p X|Y and the marginal p Y encoded as CX|Y and CY Y (theorem 4.1), and inverts the CMO CX|Y (theorem 4.2) with the help of the encoded marginal CY Y . This is analogous to the Bayes rule, where the posterior p Y |X( |x) = p X|Y (x| )p Y ( ) R Y p X|Y (x|y)p Y (y)dy is fully specified by the likelihood p X|Y and prior p Y . We can then interpret DMEs as querying the rule at the observed quantity x while leaving the rule as a function of y for inference. Consequently, we also refer to CX|Y and CY Y as the likelihood operator and the prior operator respectively. The difference between the DMO (4.2) and CMO (3.2) equations is akin to writing p Y |X( |x) using Bayes rule against using the conditional density rule. Compare the DMO decomposition (4.2) with the CMO decomposition CY |X = CY XC 1 XX = (CXY )T C 1 XX in the other direction by reversing the roles of X and Y in (3.2), which would correspond to the posterior PY |X. The CMO is composed of a joint operator CXY and an evidence operator CXX corresponding to the joint PXY and evidence PX distributions. Similarly, the DMO is also composed of a joint operator CXY = CX|Y CY Y : Hℓ Hk and an evidence operator C XX := CX|Y CY Y (CX|Y )T : Hk Hk, but both specified from the likelihood and prior operators. Motivated by this, we propose to estimate the likelihood and prior operators using separate and independently drawn Bayesian Deconditional Kernel Mean Embeddings samples. The likelihood operator CX|Y is estimated as ˆCX|Y (definition 3.3) using iid samples {xi, yi}n i=1, also denoted as x := {xi}n i=1 and y := {yi}n i=1. Note that as the likelihood operator is a CMO, these joint samples can be from any joint distribution QXY = PXY as long as its conditional distribution is also PX|Y . The prior operator CY Y is estimated as CY Y := 1 m Pm j=1 ℓ( yj, ) ℓ( yj, ) using another set of iid samples y := { yj}m j=1 from PY . Definition 4.3 (Empirical Deconditional Mean Operator). Let ϵ > 0 be a regularization hyperparameter and define ˆCX|Y and CY Y as above. The empirical DMO is C X|Y := ( ˆCX|Y CY Y )T ( ˆCX|Y CY Y ( ˆCX|Y )T + ϵI) 1. (4.3) The accents notate the set of samples used for estimation. When both sets are used such as in the estimation of the DMO C X|Y , we denote it with a bar such as C X|Y . Theorem 4.3. The nonparametric form for C X|Y is C X|Y = Ψ AT KA + mϵI 1AT ΦT , (4.4) where A := (L + nλI) 1 L, L := ΨT Ψ, and Ψ := ψ( y1) ψ( ym) . The empirical DME is then µ X=x|Y := C X|Y k(x, ). Consequently, with k(x) := {k(xi, x)}n i=1 and g := {g( yj)}m j=1, an estimate for E X|Y [g](x) is g, µ X=x|Y ℓ= g T AT KA + mϵI 1AT k(x). This motivates the following definitions, where the notation g is replaced with z, to be interpreted as target observations of g at y. Definition 4.4 (Nonparametric DME Estimator). The nonparametric DME estimator, also called the kernel DME estimator or the DME estimator in function space view, is f(x) = αT k(x) = Pn i=1 αik(xi, x), where α := A AT KA + mϵI 1 z and A := (L + nλI) 1 L. Equivalently, f(x) = z T AT KA + mϵI 1AT k(x). An alterna- tive form is f(x) = z T AT KAAT + mϵI 1k(x). When features φ(x) Rp and ψ(y) Rq are finite dimensional, we define the parametric DME estimator as follows by rewriting definition 4.4 using the Woodbury identity Definition 4.5 (Parametric DME Estimator). The parametric DME estimator, also called the feature DME estimator or the DME estimator in weight space view, is f(x) = w T φ(x), where w = [ΦAAT ΦT + mϵI] 1ΦA z and A := ΨT (ΨΨT + nλI) 1 Ψ. Equivalently, f(x) = z T AT ΦT [ΦAAT ΦT + mϵI] 1φ(x). In definition 4.4 (resp. 4.5), computational complexity is dominated by inversions for L + nλI and AT KA + mϵI (resp. ΨΨT + nλI and ΦAAT ΦT + mϵI) at O(n3) and O(m3) (resp. O(q3) and O(p3)). For the alternative form in definition 4.4, both inversions are O(n3), allowing for larger m at O(m) without compromising tractability. Kernelize features (inputs) Bayesian inference (latents) Transform observations (outputs) Figure 1. Three dimensions of model extensions (T: Transformed, B: Bayesian, K/L: Kernel/Linear, (R)R: (Ridge) Regression). Kernel extensions (blue) specify feature spaces implicitly through a kernel. Bayesian extensions (green) introduce notions of uncertainty on latent quantities (weights or functions). Finally, transformed extensions (orange) capture indirect function observations. 5. Task Transformed Gaussian Processes DMEs are constructed as solutions to the task of inferring deconditional means, which are often real-valued functions. Regression problems also address inference of real-valued functions from data. This raises curiosity towards whether DMEs can be formulated as solutions to a regression-like problem, and what insights this connection would provide. In this section, we formulate the task transformed regression problem to provide regression views of DMEs. To do this, we first briefly review transformed regression in section 5.1 before we present our contributions in section 5.2. 5.1. Transformed Regression Standard regression models often assume a Gaussian full data likelihood p(z|f) = N(z; f, σ2I) with targets z := {zi}n i=1 Rn. In the generalized setting when observations of f at x are not available but observations of linear combinations thereof are, we can use p( z|f) = N( z; M T f, Σ) for some transformation M Rn m and noise covariance Σ, where z := { zj}m j=1 Rm are the available observations. We refer to this setting as transformed regression. They can be seen as another dimension of modeling with linear ridge regression (LRR) as the base model (fig. 1). Kernel Ridge Regression (KRR) is obtained from LRR via the kernel trick and Woodbury identity, and they are maximum a posteriori (MAP) solutions or predictive means of Gaussian process regression (GPR) and Bayesian linear regression (BLR) respectively (Rasmussen & Williams, 2006). Consequently, we also refer to GPR as Bayesian kernel regression (BKR). Analogous relationships hold between transformed models. 5.2. Task Transformed Regression We define task transformed regression (TTR) as the problem of learning to predict a target variable Z from features X when no direct sample pairs of X and Z are available but instead indirect samples {xi, yi}n i=1 and { yj, zj}m j=1 with a Bayesian Deconditional Kernel Mean Embeddings Figure 2. Graphical model (chain graph) for task transformed Gaussian process regression (TTGPR). Circles are random variables. Shaded circles are observed random variables. Undirected edges indicate the GP field, where all the random variables on the field are fully connected to each other (Rasmussen & Williams, 2006). The goal is to infer f to predict z at x , using only a task or original dataset { yj, zj}m j=1 and a transformation dataset {xi, yi}n i=1. To connect the two GPs, we posit that the unobserved targets z at x and at y would have been the same if they were observed. Note that like regular GPs, to TTGPs the inputs x and y are not modeled as random variables but treated as index variables instead. mediating variable Y are available. The name illustrates the idea of transforming the task of regressing Z on Y to learn g : Y R, using the task or original dataset { yj, zj}m j=1, to the task of regressing Z on X to learn f : X R, by mediating the task dataset through the transformation dataset {xi, yi}n i=1. As the mediating variable Y links the two sets together, y and y control the task transformation. DME as solution to chained loss We formulate losses for TTR, and establish DMEs as solutions. We begin with the parametric case with f(x) = w T φ(x) and g(y) = v T ψ(y). Theorem 5.1 (Task Transformed LRR (TTLRR)). The weights of the parametric DME estimator f(x) = w T φ(x) (definition 4.5) solve chained regularized least square losses, ˆv[w] := argmin v Rq 1 n i=1 (w T φ(xi) v T ψ(yi)))2 + λ v 2, w := argmin w Rp 1 m j=1 ( zj ˆv[w]T ψ( yj))2 + ϵ w 2. The notation ˆv[w] explicitly denotes that ˆv depends on w. Conceptually, in function space view the first optimization finds g so that f at x best matches with g at y, leading to a solution ˆg[f] that is dependent on f. The second finds f so that ˆg[f] at y best matches targets z. Using the kernel trick k(x, x ) = φ(x)T φ(x ), we obtain the nonparametric case. Lemma 5.2 (Task Transformed KRR (TTKRR)). The weights of the nonparametric DME estimator f(x) = αT k(x) (definition 4.4) satisfies w = Φ α (the kernel trick). DME as posterior predictive mean of TTGP We extend TTLRR and TTKRR to the Bayesian case. This connection reveals that TTR models are transformed regression models with transformations and noise covariances that are learned. In the parametric case, we have task transformed BLR (TTBLR). We place separate independent Gaussian priors p(v) = N(v; 0, β2I) and p(w) = N(w; 0, γ2I) for g and f respectively. As z is not observed directly from f but only for g, we include noise only for observing g to arrive at likelihoods p(z|v) = N(z; v T ψ(y), σ2) and p(z|w) = N(z; w T φ(x), 0) for g and f respectively. In the nonparametric case, we have task transformed BKR (TTBKR). We place GP priors g GP(0, ℓ) and f GP(0, k) on the functions directly. Consequently, TTBKR is also referred to as task transformed Gaussian process regression (TTGPR). Similar to TTBLR, the likelihoods are p(z|g) = N(z; g(y), σ2) and p(z|f) = N(z; f(x), 0). The graphical model for TTGPR is shown in fig. 2. The two GPs for g and f are linked by constraining their targets to be the same at y and x respectively. The GP for g is used to infer the predictive distribution p( z|z), which in turn specifies the overall likelihood p( z|f) used to infer f. Detailed derivations are provided in the proof of theorem 5.3. Theorem 5.3 (Task Transformed BLR (TTBLR) and Task Transformed BKR (TTBKR)). (1) The TTBLR is a transformed BLR (TBLR) with M = ΨT (ΨΨT + σ2 β2 I) 1 Ψ and Σ = σ2 ΨT (ΨΨT + σ2 β2 I) 1 Ψ + σ2I as the transformation and noise covariance. (2) The TTBKR is a transformed BKR (TBKR) with transformation M = (L + σ2I) 1 L and noise covariance Σ = L + σ2I LT (L + σ2I) 1 L. (3) The TTBLR and TTBKR marginal likelihoods are p( z) = N( z; 0, [Σ 1 Σ 1AT ΦT CΦAΣ 1] 1) where C = [ΦAΣ 1AT ΦT + 1 γ2 I] 1 and p( z) = N( z; 0, AT KA+Σ) respectively. (4) For both models, when the posterior for g is approximated via MAP, the covariance becomes Σ = σ2I. In this case, the parametric (resp. nonparametric) DME estimator (definitions 4.4 and 4.5) is the predictive mean of a TTBLR (resp. TTBKR) with λ = σ2 nβ2 and ϵ = σ2 mγ2 (resp. λ = σ2 n and ϵ = σ2 m ). An alternative TTBKR marginal likelihood is p( z) = N( z; 0, σ2[I AT (KAAT + σ2I) 1KA] 1). (5) When both posteriors for g and f are approximated via MAP, TTBLR and TTBKR becomes TTLRR and TTKRR respectively with λ and ϵ from (4). Importantly, our end goal is to infer f. While this involves inferring g, g is not of direct interest. A simpler alternative is to only perform Bayesian inference on f and approximate g with its MAP solution, simplifying the noise covariance via (4) of theorem 5.3. This establishes a Bayesian interpretation for DMEs as MAP estimates of TTGPs. Critically, by maximizing the TTGP marginal likelihood, we can learn DME hyperparameters of kernels k and ℓ, and also λ and Bayesian Deconditional Kernel Mean Embeddings 8 6 4 2 0 2 4 6 8 y Simulation Process 8 6 4 2 0 2 4 6 8 y Observation Process 8 6 4 2 0 2 4 6 8 x Latent Function Simulated Data Observed Data True f(x) Impute Cascade DME-TTGP (Init) DME-TTGP (Learn) Figure 3. Illustration of latent function recovery with a task transformed Gaussian process (TTGP) on non-trivial simulation and observation processes p(x|y) and p(z|y). (Left) The simulation process p(x|y). (Center) The observation process p(z|y). (Right) The true latent function f, the naive solutions using cascaded regressors and imputed data, and the mean and uncertainty bounds of the TTGP, also the Bayesian DME, with initial and learned hyperparameters. All bounds are 2 standard deviations from the mean. ϵ. Furthermore, the computational complexity for alternative marginal likelihood is dominated by inversions that are O(n3) only, again allowing for larger m at O(m). In summary, we first establish the DME as a nonparametric solution to the TTR problem under chained regularized least squares losses that make learning f dependent on learning g. While λ and ϵ are previously seen as numerical adjustments to relax RKHS assumptions and stabilize matrix inversions in the KME framework, they can now be seen as controlling the amount of function regularization under this loss. Secondly, we present TTGPs as nonparametric Bayesian solutions to this regression problem and show that DMEs are their posterior predictive means. Again, inference of f is dependent on the inference of g, allowing GP uncertainties to propagate through. This connection provides Bayesian interpretations of DMEs and enable uncertainty estimation in inferring deconditional means. Critically, we use this to derive marginal likelihoods for hyperparameter learning. 6. Nonparametric Bayes Rule While DMOs were constructed as solutions to the deconditional mean problem, they also resemble Bayes rule when we focus solely on considering the encoded relationship between X and Y . This was motivated by theorem 4.1, which revealed that the DMO can be fully specified by the CMO CX|Y and the second order mean embedding CY Y that encoded the likelihood PY |X and prior PY respectively. To establish this view, we investigate the conditions for which the DMO C X|Y coincide with the CMO CY |X that encodes the posterior PY |X, leading to a nonparametric Bayes rule. While first class citizens of probability rules are density evaluations, first class citizens of the KME framework are expectations. Consequently, instead of relating density evaluations, rules under the KME framework relate mean embeddings of distributions at various orders. Importantly, while a distribution Y PY has one simple density evaluation p Y (y), it can have different RKHS representations at different orders such as µY and CY Y or higher. A nonparametric Bayes rule is a rule which translates Bayes rule into the RKHS, where distributions are represented as RKHS operators, alleviating limitations from parametric assumptions such as Gaussian posteriors. It computes a posterior operator CY |X when given only likelihood operators (e.g. CX|Y ) and prior operators (e.g. CY Y ). The DMO is appealing as all operators involved are of second order and the same second order likelihood and prior operators are used for both the joint and evidence operator. However, because C XX is not necessarily the same as CXX, the DMO C X|Y = (CXY )T (C XX) 1 is not necessarily the posterior operator CY |X = (CXY )T (CXX) 1. Nevertheless, under certain conditions they coincide with each other. Theorem 6.1. If CX|Y CY |XCXX = CXX, then C XX = CXX and C X|Y = CY |X. A special instance where the assumptions are met is when X = r(Y ) where r is not necessarily invertible. Importantly, for empirical DMOs, having xi = xj for any yi = yj suffices, which can be achieved if all yi are unique. Furthermore, empirical DMOs C X|Y can be seen as generalizations of empirical CMOs ˆCY |X in the other direction. Theorem 6.2. If m = n and yi = yi for all i {1, . . . , n}, then the empirical DMO corresponding to PX|Y becomes the empirical CMO corresponding to PY |X for λ 0+, lim λ 0+ C X|Y = Ψ K + nϵI 1ΦT = ˆCY |X. (6.1) Intuitively, suppose {xi, yi}n i=1 are from p(x, y) := p(x|y)p(y) and { yj}m j=1 = {yi}n i=1 is from p(y), then the DMO of p(x|y) is equivalent to the CMO of p(y|x) = p(x|y)p(y)/ R Y p(x|y)p(y)dy in the other direction, as per theorem 6.2. In general, however, if {xi, yi}n i=1 are from q(x, y) := p(x|y)q(y) and { yj}m j=1 is from p(y), then using the joint samples from q(x, y) only to build the CMO will yield the CMO of q(y|x) = p(x|y)q(y)/ R Y p(x|y)q(y)dy, while using both the joint samples from q(x, y) and marginal samples from p(y) to build the DMO will yield the CMO corresponding to p(y|x) = p(x|y)p(y)/ R Y p(x|y)p(y)dy. Appendix E details further parallels with probabilistic rules. Bayesian Deconditional Kernel Mean Embeddings 6 4 2 0 2 4 6 4 4 True Generating Gaussian Process 6 4 2 0 2 4 6 DME-TTGP: Random Representation 6 4 2 0 2 4 6 With hyperparameters learned jointly DME-TTGP: Learned Representation 6 4 2 0 2 4 6 With true hyperparameters DME-TTGP: Learned Representation True Representation True Function Generated Data Sparse Representation Model Figure 4. Sparse representation learning on a dataset of 100 points generated by using the toy process from Rasmussen & Williams (2006) as ground truth. (Left) The true function, represented exactly by a GP mean using 5 points. (Left Center) DME using 5 random points. (Right Center) DME with the 5 points and all hyperparameters learned jointly via its marginal likelihood. (Right) DME with the 5 points learned via its marginal likelihood under true hyperparameters. The vertical position of the sparse representation has no meaning. 7. Related Work DMOs have strong connections to KBR. Both provide a nonparametric Bayes rule under the KME framework. In contrast to DMOs where both likelihood and prior operators are of second order, both KBR(a) and KBR(b) (Song et al., 2013; Fukumizu et al., 2013) use a third order likelihood operator CXX|Y and a first order prior embedding µY for the evidence operator CXX = CXX|Y µY . KBR(b) further uses a different third order likelihood operator CXY |Y for the joint operator CXY = CXY |Y µY . Consequently, KBR becomes sensitive to inverse regularizations and effects of prior samples y can vanish. For instance, when ϵ 0+, KBR(b) degenerate to CY |X = ΨK 1ΦT , which is a CMO that no longer depend on y. Instead, DMOs degenerate to C X|Y = ΨAT AAT 1K 1ΦT , retaining their original structure. Detailed comparisons are provided in appendix F. Viewing KMEs as regressors provides valuable insights and interpretations to the framework. CMOs can be established as regressors where the vector-valued targets are also kernel induced features (Gr unew alder et al., 2012). In contrast, we establish DMOs as solutions to task transformed regressors which recover latent functions that, together with a likelihood, governs interactions between three variables. The TTR problem describes the setting of learning from conditional distributions in the extreme case where only one sample of xi is available for each yi to describe p(x|y). Dual KMEs (Dai et al., 2017) formulate this setting as a saddle point problem, and employ stochastic approximations to efficiently optimize over the function space. However, without connections to Bayesian models such as TTGPs that admit a marginal likelihood, hyperparameter selection often require inefficient grid search. Hyperparameter learning of marginal embeddings have been investigated by placing GP priors on the embedding itself to yield a marginal likelihood objective (Flaxman et al., 2016). However, it is unclear how this can be extended to CMEs. Our marginal likelihoods (theorem 5.3 and theorem G.1) provide such objective for DMEs and, due to theorem 6.2, it can also be applied to CMEs as a special case. 8. Applications and Experiments While DMEs are developed to complement the theoretical framework of KMEs, in this section we describe and demonstrate some of their practical applications with experiments. 8.1. Hyperparameter Learning for TTR We first illustrate in fig. 3 the TTR problem, the primary application of TTGPs and DMEs. While X and Y are multivariate in general, we use 1D examples to enable visualizations. Although this is a 1D problem, the simulation process p(x|y) and observation process p(z|y) are governed by non-trivial relationships where successful recovery of f requires dealing with difficult multi-modalities in p(y|x). To generate the data, we choose non-trivial functions r and f and generate Xi = r(Yi)+ηi and Zj = f(r( Yj)+ ηj)+ ξj, where Yi, Yj U( 6, 6) and ηi, ηj, ξj N(0, 0.252) for all i {1, . . . , n} and j {1, . . . , m}. In this way, p(x|y) = N(x; r(y), 0.252), p(z|x) = N(z; f(x), 0.252), and E[Z|Y = y] = E[f(r(y) + η) + ξ] = E[f(X)|Y = y]. By optimizing the marginal likelihood in theorem 5.3 (3), we see that the DME is able to adapt from its initial hyperparameters to learn the latent function accurately. We compare this to two naive solutions that one may propose when faced with a TTR problem. The cascade method trains separate regressors from X to Y , with the transformation set, and from Y to Z, with the task set. They use the former to predict y from x and the latter to predict z from y . The impute method trains a regressor from Y to Z with the task set and predicts zfake at locations y, and trains a regressor on the dataset (x, zfake) to predict z from a new x . We use GPR means (KRR) for all such regressors. Both methods suffer because uncertainty propagation is lost by training regressors separately. The cascade method suffers further because p(y|x) is usually highly multi-modal such as in this example, so unimodal regressors like GPR from X to Y are unsuitable. This also highlights that while DMEs provides unimodal Gaussian uncertainty on function evaluations and thus Z, they capture multi-modality as a nonparametric Bayes rule between X and Y . Bayesian Deconditional Kernel Mean Embeddings 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 θ Exponential-Gamma: Approximate Posteriors True Posterior DME (Global) DME (Local) KELFI (Global) KELFI (Local) K-ABC K2-ABC KBR GPS-ABC-1 GPS-ABC-2 MDN KBR KELFI DME -5.2 -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 Lotka-Volterra: Approximate Posteriors True Parameter Figure 5. Application to LFI. (Left) Approximate posteriors under kernel based LFI methods for the toy exponential-gamma problem using 100 simulations. Global and Local refer to the optimality of model hyperparameters with respect to their respective approximate marginal likelihoods. (Right) Approximate posteriors of the first (log) parameter. Error bars represent the middle 95% credible interval. 8.2. Sparse Representation Learning with TTGP A special case of the TTR problem is to learn sparse representations for big data with trainable inducing points. Continuing with the notations used so far, we are given a large original dataset ( y, z) of size m, with inputs Y and target Z. We let the transformation dataset be a set of n inducing points x = y = u for Y where n << m. That is, we degenerate to X = Y and X = Y. We maximize the alternative marginal likelihood in theorem 5.3 (4) with respect to the inducing points and learn the TTGP hyperparameters jointly. For predictive mean we use the alternative computational form in definition 4.4. Similar form exists for the covariance. These alternative forms are suitable for this application because n is small for its O(n3) inversions and dependence on m is only O(m). We illustrate this process in fig. 4. 8.3. Likelihood-Free Inference with DME As a nonparametric Bayes rule, DMEs can be used for likelihood-free inference (LFI) (Marin et al., 2012) where likelihood evaluations are intractable but sampling from a simulator x p(x|θ) is possible. The simulator takes parameters θ and stochastically generates simulated data that are often summarized into statistics x. Observed data are also summarized into statistics y, and discrepancies with x are often measured by an ϵ-kernel κϵ(y, x) = pϵ(y|x) such that pϵ(y|θ) = R pϵ(y|x)p(x|θ)dθ. This ϵ is not to be confused with the regularization used for C XX, which we denote as δ for this section only. After selecting a prior p(θ), the goal is to approximate the posterior pϵ(θ|y). Translating notations into the LFI setting, we have xi xi, yi θi, yj θj, and x y. We first simulate xi p(x|θi) on parameters {θi}n i=1 π(θ) not necessarily from the prior to get {θi, xi}n i=1 for the likelihood, and sample { θj}m j=1 p(θ) for the prior. We then build the DME µΘ|X=y and sample it with kernel herding (Chen et al., 2010) for posterior super-samples. This is described in algorithm 8.1. We also provide the approximate marginal likelihood objective q to maximize for hyperparameter learning of the DME. Derivations are detailed in appendix G. Algorithm 8.1 Deconditional Mean Embeddings for LFI 1: Input: Data y, simulations {θi, xi}n i=1 p(x|θ)π(θ), prior samples { θj}m j=1 p(θ), query points {θ r}R r=1, kernels k, κϵ, ℓ, and ℓ , regularization λ and δ 2: L {ℓ(θi, θj)}n,n i,j=1, L {ℓθi, θj)}n,m i,j=1 3: A (L + nλI) 1 L, L {ℓ( θj, θ r)}m,R j,r=1, 4: K {k(xi, xj)}n,n i,j=1, k(y) {k(xi, y)}n i=1 5: DME: µ ( L )T AT KAAT + mδI 1k(y) RR 6: for s {1, . . . , S} with a 0 RR initialized do 7: ˆθs θ r where r argmaxr µr (ar/s) 8: a a + {ℓ (θ r, ˆθs)}R r=1 9: end for 10: Output: Posterior super-samples {ˆθs}S s=1 11: Learning: q mean(AT κϵ), κϵ {κϵ(y, xi)}n i=1 Figure 5 demonstrates algorithm 8.1 on two standard benchmarks. For the toy exponential-gamma problem we compare directly with other kernel approaches. As simulations are usually very expensive, we show the case with very limited simulations (n = 100), leading to most methods producing posteriors wider than the ground truth. Nevertheless, by optimizing q in line 11, DMEs can adapt their kernel length scales accordingly. For Lotka-Volterra, the ABC methods used more than 100000 simulations, while MDN used 10000 simulations. To achieve competitive accuracy, kernel approaches such as DMEs, KELFI (Hsu & Ramos, 2019), and KBR used 2000, 2500, and 2500 simulations. Acronyms and experimental details are described in appendix G. 9. Conclusion The connections of DMEs with CMEs and GPs produce useful insights towards the KME framework, and are important steps towards establishing Bayesian views of KMEs. DMEs provide novel solutions to a class of nonparametric Bayesian regression problems and enable applications such as sparse representation learning and LFI. For future work, relaxing assumptions required for DMOs as a nonparametric Bayes rule can have fruitful theoretical and practical implications. Bayesian Deconditional Kernel Mean Embeddings Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017. Chen, Y., Welling, M., and Smola, A. Super-samples from kernel herding. In The Twenty-Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI10), pp. 109 116. AUAI Press, 2010. Dai, B., He, N., Pan, Y., Boots, B., and Song, L. Learning from Conditional Distributions via Dual Embeddings. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1458 1467. PMLR, 20 22 Apr 2017. Flaxman, S., Sejdinovic, D., Cunningham, J. P., and Filippi, S. Bayesian learning of kernel embeddings. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pp. 182 191. AUAI Press, 2016. Fukumizu, K., Bach, F. R., and Jordan, M. I. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5 (Jan):73 99, 2004. Fukumizu, K., Song, L., and Gretton, A. Kernel Bayes rule: Bayesian inference with positive definite kernels. Journal of Machine Learning Research, 14(1):3753 3783, 2013. Gr unew alder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., and Pontil, M. Conditional mean embeddings as regressors. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, volume 2, pp. 1823 1830, 2012. Hastings, W. K. Monte carlo sampling methods using markov chains and their applications. 1970. Higham, N. J. Accuracy and stability of numerical algorithms. SIAM, 2002. Hsu, K. and Ramos, F. Bayesian Learning of Conditional Kernel Mean Embeddings for Automatic Likelihood-Free Inference. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 2631 2640. PMLR, 16 18 Apr 2019. Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. Approximate Bayesian computational methods. Statistics and Computing, 22(6):1167 1180, 2012. Meeds, E. and Welling, M. GPS-ABC: Gaussian process surrogate approximate Bayesian computation. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp. 593 602. AUAI Press, 2014. Muandet, K., Fukumizu, K., Sriperumbudur, B., Sch olkopf, B., et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends R in Machine Learning, 10(1-2):1 141, 2017. Murray-Smith, R. and Pearlmutter, B. A. Transformations of gaussian process priors. In Deterministic and Statistical Methods in Machine Learning, pp. 110 123. Springer, 2005. Nakagome, S., Fukumizu, K., and Mano, S. Kernel approximate Bayesian computation in population genetic inferences. Statistical applications in genetics and molecular biology, 12(6):667 678, 2013. Papamakarios, G. and Murray, I. Fast ε-free inference of simulation models with bayesian conditional density estimation. In Advances in Neural Information Processing Systems, pp. 1028 1036, 2016. Park, M., Jitkrittum, W., and Sejdinovic, D. K2-ABC: Approximate Bayesian Computation with Kernel Embeddings. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pp. 398 407. PMLR, 09 11 May 2016. Rasmussen, C. E. and Williams, C. K. I. Gaussian processes for machine learning. The MIT Press, 2006. Song, L., Huang, J., Smola, A., and Fukumizu, K. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 961 968. ACM, 2009. Song, L., Fukumizu, K., and Gretton, A. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98 111, 2013. Tran, D., Ranganath, R., and Blei, D. Hierarchical implicit models and likelihood-free variational inference. In Advances in Neural Information Processing Systems, pp. 5523 5533, 2017.