# importance_weighted_kernel_bayes_rule__2881dc55.pdf Importance Weighting Approach in Kernel Bayes Rule Liyuan Xu 1 Yutian Chen 2 Arnaud Doucet 2 Arthur Gretton 1 We study a nonparametric approach to Bayesian computation via feature means, where the expectation of prior features is updated to yield expected kernel posterior features, based on regression from learned neural net or kernel features of the observations. All quantities involved in the Bayesian update are learned from observed data, making the method entirely model-free. The resulting algorithm is a novel instance of a kernel Bayes rule (KBR). Our approach is based on importance weighting, which results in superior numerical stability to the existing approach to KBR, which requires operator inversion. We show the convergence of the estimator using a novel consistency analysis on the importance weighting estimator in the infinity norm. We evaluate our KBR on challenging synthetic benchmarks, including a filtering problem with a state-space model involving high dimensional image observations. The proposed method yields uniformly better empirical performance than the existing KBR, and competitive performance with other competing methods. 1. Introduction Many machine learning applications are reduced to the problem of inferring latent variables in probabilistic models. This is achieved using Bayes rule, where a prior distribution over the latent variables is updated to obtain the posterior distribution, based on a likelihood function. Probabilities involved in the Bayes update have generally been expressed as probability density functions. For many interesting problems, the exact posterior density is intractable; in the event that the likelihood is known, we may use approximate inference techniques, including Markov Chain Monte Carlo (MCMC) and Variational inference (VI). When the 1Gatsby Unit 2Deep Mind. Correspondence to: Liyuan Xu . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). likelihood function is unknown, however, and only samples drawn from it can be obtained, these methods do not apply. We propose to use a kernel mean embedding representation of our probability distributions as the key quantity in our Bayesian updates (Smola et al., 2007), which will enable nonparametric inference without likelihood functions. Kernel mean embeddings characterize probability distributions as expectations of features in a reproducing kernel Hilbert space (RKHS). When the space is characteristic, these embeddings are injective (Fukumizu et al., 2009; Sriperumbudur et al., 2010). Expectations of RKHS functions may be expressed as dot products with the corresponding mean embeddings. Kernel mean embeddings have been employed extensively in nonparametric hypothesis testing (Gretton et al., 2012; Chwialkowski et al., 2016), however they have also been used in supervised learning for distribution-valued inputs (Muandet et al., 2012; Szabo et al., 2016), and in various other inference settings (Song et al., 2009; Grunewalder et al., 2012; Boots et al., 2013; Jitkrittum et al., 2015; Singh et al., 2019; Muandet et al., 2021). We will focus here on the Kernel Bayes Rule (KBR), where a prior mean embedding is taken as input and is updated to return the mean embedding of the posterior. The goal is to express all quantities involved in the KBR updates as RKHS functions learned from observed data: parametric models should not be required at any stage in the computation. We will address in particular the application of kernel Bayes rule to filtering, where a latent state evolves stochastically over time according to Markovian dynamics, with noisy, high dimensional observations providing information as to the current state: the task is to construct a posterior distribution over the current state from the sequence of noisy observations up to the present. In the event that a parametric model is known for the latent dynamics, then filtering can be achieved either by sampling from the model (Kanagawa et al., 2016), or through closed-form updates (Nishiyama et al., 2020), however, we propose here to make no modeling assumptions concerning the latent dynamics, beyond observing them in a training stage. An alternative to kernel Bayes updates in the filtering setting was proposed by Jasra et al. (2012) but, for high-dimensional observations, it requires introducing summary statistics which have a high impact on performance and are difficult to select. The first formulation of Kernel Bayes Rule, due to Fukumizu et al. (2013), yields the desired model-free, samplederived posterior embedding estimates, and a filtering procedure that is likewise nonparametric and model-free. At a high level, the resulting posterior is obtained via a two-stage least squares procedure, where the output of the first regression stage is used as input to the second stage. Unfortunately, the original KBR formulation requires an unconventional form of regularization in the second stage, which adversely impacts the attainable convergence rates (see Section 3.2), and decreases performance in practice. In the present work, we introduce importance-weighted KBR (IW-KBR), a novel design for a KBR which does not require the problematic second-stage regularization. The core idea of our approach is to use importance weighting, rather than two-stage regression, to achieve the required Bayesian update. We provide a convergence analysis, requiring as an intermediate step a novel analysis on our importance weighting estimator in infinity norm, which may be of independence interest. Our IW-KBR improves on the convergence rate of the original KBR under certain conditions. As an algorithmic contribution, we introduce adaptive neural network features into IW-KBR, which is essential when the observations are images or other complex objects. In experiments, IW-KBR outperforms the original KBR across all benchmarks considered, including filtering problems with state-space models involving high dimensional image observations. The paper is structured as follows. In Section 2, we introduce the basic concepts of kernel methods and review the original KBR approach. We will also introduce density ratio estimators, which are used in estimating the importance weights. Then, in Section 3 we introduce the proposed KBR approach using kernel and neural net features and provide convergence guarantees when kernel features are employed. We describe the Kernel Bayes Filter (KBF) in Section 4, which applies KBR to the filtering problem in a state-space model. We demonstrate the empirical performance in Section 6, covering three settings: non-parametric Bayesian inference, low-dimensional synthetic settings introduced in Fukumizu et al. (2013), and challenging KBF settings where the observations consist of high-dimensional images. 2. Background and preliminaries In this section, we introduce kernel mean embeddings, which represent probability distributions by expected RKHS features. We then review Kernel Bayes Rule (KBR) (Fukumizu et al., 2013), which aims to learn a mean posterior embedding according to Bayes rule. Kernel mean embeddings: Let (X,Z) be random variables on X Z with distribution P and density function p(x,z), and k X : X X R and k Z : Z Z R be measurable positive definite kernels corresponding to the scalar-valued RKHSs HX and HZ, respectively. We denote the feature maps as ψ(x) = k X (x, ) and ϕ(z) = k Z(z, ), and RKHS norm as . The kernel mean embedding m P (X) of a marginal distribution P(X) is defined as m P (X) = EP [ψ(X)] HX , and always exists for bounded kernels. From the reproducing property, we have f,m P (X) = EP [f(X)] for all f HX , which is useful to estimate the expectations of functions. Furthermore, it is known that the embedding uniquely defines the probability distribution (i.e. m P (X) = m Q(X) implies P = Q) if kernel k X is characteristic: for instance, the Gaussian kernel has this property (Fukumizu et al., 2009; Sriperumbudur et al., 2010). In addition to kernel means, we will require kernel covariance operators, C(P ) XX =EP [ψ(X) ψ(X)], C(P ) XZ =EP [ψ(X) ϕ(Z)], C(P ) ZX = C(P ) XZ , C(P ) ZZ = EP [ϕ(Z) ϕ(Z)], where is the tensor product such that [a b]c = a b,c , and denotes the adjoint of the operator. Covariance operators generalize finite-dimensional covariance matrices to the case of infinite kernel feature spaces, and always exist for bounded kernels. The kernel conditional mean embedding (Song et al., 2009) is the extension of the kernel mean embedding to conditional probability distributions, and is defined as m P (Z|X)(x) = EP [ϕ(Z)|X = x] HZ. Under the regularity condition EP [g(Z)|X = x] HX for all g HZ, there exists an operator EP HX HZ such that m P (Z|X)(x) = (EP ) ψ(x) (Song et al., 2009; Grunewalder et al., 2012; Singh et al., 2019). The conditional operator EP can be expressed as the minimizer of the surrogate regression loss ℓ(P ), defined as ℓ(P )(E) = EP ϕ(Z) E ψ(X) 2 . This minimization can be solved analytically, and the closedform solution is given as EP = (C(P ) XX) 1C(P ) XZ. An empirical estimate of the conditional mean embedding is straightforward. Given i.i.d samples {(xi,zi)}n i=1 P, we minimize ˆℓ(P,λ), defined as ˆℓ(P,λ)(E) = 1 i=1 ϕ(zi) E ψ(xi) 2 + λ E 2, which is the sample estimate of ℓ(P ) with added Tikhonov regularization, and the norm E is the Hilbert-Schmidt norm. The solution of this minimization problem is ˆEP,λ = ( ˆC(P ) XX + λI) 1 ˆC(P ) XZ. Here, we denote ˆC(P ) ZX, ˆC(P ) XX, ˆC(P ) XZ, ˆC(P ) ZZ as the empirical estimates of the covariance operators C(P ) ZX,C(P ) XX,C(P ) XZ,C(P ) ZZ , respectively; e.g. ˆC(P ) ZX = 1 n Pn i=1 ϕ(zi) ψ(xi). Kernel Bayes Rule: Based on conditional mean embedding, Fukumizu et al. (2013) proposed a method to realize Bayes rule in kernel mean embedding representation. In Bayes rule, we aim to update the prior distribution Π on latent variables Z, with the density function π(z), to the posterior Q(Z|X = x), where the joint density q(z,x) of distribution Q is given by q(z,x) = π(z)g(x|z). Here, g(x|z) is the likelihood function and x X is the conditioning point. The density of Q(Z|X = x) is often intractable, since it involves computing the integral R q(z,x)dz. Instead, KBR aims to update the embedding of Π, denoted as mΠ, to the embedding of the posterior m Q(Z|X)( x). Fukumizu et al. (2013) show that such an update does not require the closed-form expression of likelihood function g(x|z). Rather, KBR learns the relations between latent and observable variables from the data {(xi,zi)}n i=1 P, where the density of data distribution P shares the same likelihood p(z,x) = p(z)g(x|z). We require P to share the likelihood of Q, otherwise the relation between observations and latent cannot be learned. We remark that KBR also applies to Approximate Bayesian Computation (ABC) (Tavar e et al., 1997; Marjoram et al., 2003), in which the likelihood function is intractable, but we can simulate from it. If π(z) = p(z), we could simply estimate m Q(Z|X)( x) by the conditional mean embedding learned from {(xi,zi)}n i=1, but we will generally require π(z) = p(z), since the prior can be the result of another inference step (notably for filtering: see Section 4). We present the solution of Fukumizu et al. (2013) for this case. Let C(Q) ZX,C(Q) XX,C(Q) ZX,C(Q) XX be the covariance operators on distribution Q, defined similarly to the covariance operators on P. Similarly, the conditional operator EQ : HX HZ satisfying m Q(Z|X)(x) = E Qψ(x) is a minimizer of the loss ℓ(Q)(E) = EQ ϕ(Z) E ψ(X) 2 . Unlike in the conditional mean embedding case, however, we cannot directly minimize this loss, since data {xi,zi} are not sampled from Q . Instead, Fukumizu et al. uses the analytical form of EQ: EQ = C(Q) ZX(C(Q) XX) 1, and replaces each operator with vector-valued kernel ridge regression estimates, ˆC(Q,org) XZ = i=1 γiψ(xi) ϕ(zi), ˆC(Q,org) XX = i=1 γiψ(xi) ψ(xi), where each weight γi is given as γi = ϕ(zi), ˆC(P ) ZZ + ηI 1 ˆmΠ Here, η is another Tikhonov regularization parameter. Although these estimators are consistent, ˆC(Q,org) XX is not necessarily positive semi-definite since weight γi can be negative. This causes instabilities when inverting the operator. Fukumizu et al. mitigate this by applying another type of Tikhonov regularization, yielding an alternative estimate of EQ, ˆE(org) Q,λ = ˆC(Q,org) XX ˆC(Q,org) XX 2 + λI 1 ˆC(Q,org) XZ . (2) Density Ratio Estimation: The core idea of our proposed approach is to use importance sampling to estimate ℓ(Q). To obtain the weights, we need to estimate the density ratio r0(z) = π(z)/p(z). We may use any density ratio estimator, as long as it is computable from data {zi}n i=1 and the prior embedding estimator ˆmΠ. We focus here on the Kernel-Based unconstrained Least-Squares Importance Fitting (Ku LSIF) estimator (Kanamori et al., 2012), which is obtained by minimizing rη = argmin r HZ D r, ˆC(P ) ZZ r E r, ˆmΠ + η The estimator ˆr(z) is obtained by truncating the solution r at zero: ˆr(z) = max(0, rη(z)). Interestingly, the Ku LSIF estimator at data point zi can be written ˆr(zi) = max(0,γi), where γi is the weight used in (1). Kanamori et al. (2012) developed a convergence analysis of this Ku LSIF estimator in ℓ2-norm based on the bracketing entropy (Cucker & Smale, 2001). This analysis, however, is insufficient for the KBR case, and we will establish stronger convergence results under different assumptions in Section 3.2. 3. Importance-Weighted KBR In this section, we introduce our importance-weighted KBR (IW-KBR) approach and provide a convergence analysis. We also propose a method to learn adaptive features using neural networks so that the model can learn complex posterior distributions. 3.1. Importance Weighted KBR Our proposed method minimizes the loss ℓ(Q), which is estimated by the importance sampling. Using density ratio r0(z) = π(z)/p(z), the loss ℓ(Q) can be rewritten as ℓ(Q)(E) = EP r0(Z) ϕ(Z) E ψ(X) 2 . Hence, we can construct the empirical loss with added Tikhonov regularization, ˆℓ(Q,λ)(E) = 1 i=1 ˆri ϕ(zi) E ψ(xi) 2 + λ E 2, (3) where ˆri = ˆr(zi) is a non-negative estimator of density ratio r0(zi). Again, the minimizer of ˆℓ(Q,λ)(E) can be obtained analytically as ˆEQ,λ = ˆC(Q) XX + λI 1 ˆC(Q) XZ, (4) ˆri n ψ(xi) ϕ(zi), ˆC(Q) XX = ˆri n ψ(xi) ψ(xi). Note that ˆC(Q) XX is always positive semi-definite since ˆri is non-negative by definition. Using the Ku LSIF estimator, this is the truncated weight ˆri = max(0,γi) described previously. Given estimated conditional operator ˆEQ,λ in (4), we can estimate the conditional embedding m Q(Z|X)(x) as ˆm Q(Z|X)(x) = ( ˆEQ,λ) ψ(x), as shown in Algorithm 1. As illustrated, the posterior mean embedding ˆm Q(Z|X) is represented by the weighted sum over the same RKHS features ϕ(z) = k Z(z, ) used in the density ratio estimator ˆr. We remark that this need not be the case, however: the weights w could be used to obtain a posterior mean embedding over a different feature space to that used in computing ˆr, simply by substituting the desired ϕ(zi) for ϕ(zi) in the sum from the third step. For example, we could use a Gaussian kernel to compute the density ratio ˆr, and then a linear kernel ϕ(z) = z to estimate the posterior mean ˆEQ[Z|X] = P The computational complexity of Algorithm 1 is O(n3), which is the same complexity as the ordinary kernel ridge regression. This can be accelerated by using Random Fourier Features (Rahimi & Recht, 2008) or the Nystr om Method (Williams & Seeger, 2001). 3.2. Convergence Analysis In this section, we analyze ˆEQ,λ in (4). First, we state a few regularity assumptions. Assumption 3.1. The kernels k X and k Z are continuous and bounded: ϕ(z) κ1 < , ψ(x) κ2 < hold almost surely. Algorithm 1 Importance Weighted Kernel Bayes Rule Input: Samples {(xi,zi)}n i=1 P, Estimated prior embedding ˆmΠ, regularization parameters (η,λ), Conditioning point x. 1: Compute Gram matrices GX,GZ Rn n. (GX)ij = k X (xi,xj),(GZ)ij = k Z(zi,zj). 2: Compute Ku LSIF ˆr = (ˆr1,..., ˆrn) Rn as ˆr = max 0,n(GZ + nηI) 1 gΠ , where max operates element-wisely and (gΠ)i = ˆmΠ,ϕ(zi) . 3: Estimate ˆm Q(Z|X)(x) = Pn i=1 wiϕ(zi), with weight w = (w1,...,wn) given as where D = diag(ˆr) Rn n and (g x)i = k X (xi, x). Assumption 3.2. We have r0 HZ and g1 HZ such that r0 = (C(P ) ZZ )β1g1 and g1 ζ1 for β1 (0,1/2],ζ1 < given. Assumption 3.3. Γ HX HZ such that EQ = (C(Q) XX)β2Γ and Γ ζ2 for β2 (0,1/2],ζ2 < given. Assumption 3.1 is standard for mean embeddings, and many widely used kernel functions satisfy it, including the Gaussian kernel and the Mat ern kernel. Assumptions 3.2 and 3.3 assure the smoothness of the density ratio r0 and conditional operator EQ, respectively. See (Smale & Zhou, 2007; Caponnetto & Vito, 2007) for a discussion of the first assumption, and (Singh et al., 2019, Hypothesis 5) for the second. We will further discuss the implication of Assumption 3.2 using the ratio of two Gaussian distributions in Appendix A. Note that Assumption 3.2, also used in Fukumizu et al. (2013), should be treated with care. For example, r0 HZ imposes supz Z r0(z) r0 supz Z ϕ(z) < , which is violated when we consider the ratio of two Gaussian distributions with the different means and the same variance. We now show that the Ku LSIF estimator converges in infinity norm . Theorem 3.4. Suppose Assumptions 3.1 and 3.2. Given data {xi,zi} P and the estimated prior embedding ˆmΠ such that ˆmΠ EΠ [ϕ(X)] Op(n α1) for α1 (0,1/2], by setting η = O(n α1 β1+1 ), we have r0 ˆr Op n α1β1 The proof is given in Appendix B.1. This result differs from the analysis in Kanamori et al. (2012), which establishes convergence of the Ku LSIF estimator in ℓ2-norm, based on the bracketing entropy (Cucker & Smale, 2001). Our result cannot be directly compared with theirs, due to the differences in the assumptions made and the norms used in measuring convergence (in particular, we require convergence in infinity norm). Establishing a relation between the two results represents an interesting research direction, though it is out of the scope of this paper. Theorem 3.4 can then be used to obtain the convergence rate of covariance operators. Corollary 3.5. Given the same conditions in Theorem 3.4, by setting η = O(n α1 β1+1 ), we have ˆC(Q) XX C(Q) XX Op n α1β1 The cross covariance operator ˆC(Q) XZ also converges at the same rate. This rate is slower than the original KBR estimator, however, which satisfies ˆC(Q,org) XX C(Q) XX Op n α1(2β1+1) (Fukumizu et al., 2013). This is inevitable since the original KBR estimator uses the optimal weights γi to estimate C(Q) XX, while our estimator uses truncated weights ˆri = max(0,γi), which introduces a bias in the estimation. Given our consistency result on ˆC(Q) XX, ˆC(Q) XZ, we can show that the estimated conditional operators are also consistent. Theorem 3.6. Suppose Assumptions 3.1 and 3.3. Given data {xi,zi} P and estimated covariance operators such that ˆC(Q) XX C(Q) XX Op(n α2) and ˆC(Q) XZ C(Q) XZ Op(n α2), by setting λ = O(n α2 β2+1 ) we have ˆEQ,λ EQ Op n α2β2 The proof is in Appendix B.1. This rate is faster than the original KBR estimator (Fukumizu et al., 2013), which satisfies ˆE(org) Q,λ EQ Op n 2α2β2 This is the benefit of avoiding the regularization of the form (2) and using instead (4). Given Corollary 3.5 and Theorem 3.6, we can thus show that ˆEQ,λ EQ Op n α1β1β2 (β1+1)(β2+1) , while Fukumizu et al. (2013) obtained ˆE(org) Q,λ EQ Op n α1(2β1+1)2β2 (2β1+2)(2β2+5) . The approach that yields the better overall rate depends on the smoothness parameters (β1,β2). Our approach converges faster than the original KBR when the density ratio r0 is smooth (i.e. β1 1/2) and the conditional operator less smooth (i.e. β2 0). Note further that Sugiyama et al. (2008) show, even when r0 = HZ, that the Ku LSIF estimator ˆr converges to the element in RKHS HZ with the least ℓ2 error. We conjecture that our method might thus be robust to misspecification of the density estimator, although this remains a topic for future research (in particular, our proof requires consistency of the form in Theorem 3.4). 3.3. Learning Adaptive Features in KBR Although kernel methods benefit from strong theoretical guarantees, a restriction to RKHS features limits our scope and flexibility, since this requires pre-specified feature maps. Empirically, poor performance can result in cases where the observable variables are high-dimensional (e.g. images), or have highly nonlinear relationships with the latents. Learned, adaptive neural network features have previously been used to substitute for kernel features when performing inference on mean embeddings in causal modeling (Xu et al., 2021a;b). Inspired by this work, we propose to employ adaptive NN observation features in KBR. Recall that we learn the conditional operator ˆEQ,λ by minimizing the loss ˆℓ(Q,λ) defined in (3). We propose to jointly learn the feature map ψθ with the conditional operator ˆEQ,λ by minimizing the same loss, ˆℓ(Q,λ)(E,θ) = 1 i=1 ˆri ϕ(zi) E ψθ(xi) 2 + λ E 2, where ψθ : X Rd is the d-dimensional adaptive feature represented by a neural network parameterized by θ. As in the kernel feature case, the optimal operator can be obtained from (4) for a given value of θ. From this, we can write the loss for θ as ℓ(θ) = argmin E ˆℓ(Q,λ)(E,θ) = tr GZ(D DΨ θ (ΨθDΨ θ + λI) 1ΨθD) , where GZ is the Gram matrix (GZ)ij = k Z(zi,zj), D is the diagonal matrix with estimated importance weights D = diag(ˆr1,..., ˆrn) and Ψθ Rn d is the feature matrix defined as Ψθ = [ψθ(x1),...,ψθ(xn)]. See Appendix B for the derivation. The loss ℓ(θ) can be minimized using gradient based optimization methods. Given the learned parameter ˆθ = argminℓ(θ) and the conditioning point x, we can estimate the posterior embedding ˆm Q(Z|X)(x) = Pn i=1 wiϕ(zi) with weights w = (w1,...,wn) given by ΨˆθDΨ ˆθ + λI 1 ψˆθ( x). Note that this corresponds to using a linear kernel on the learned features, k X (x,x ) = ψˆθ(x) ψˆθ(x ), in Algorithm 1. In experiments, we further employ a finitedimensional random Fourier feature approximation of ϕ(z) to speed computation (Rahimi & Recht, 2008). 4. Kernel Bayes Filter We next describe an important use-case for KBR, namely the Kernel Bayes Filter (KBF) (Fukumizu et al., 2013). Consider the following time invariant state-space model with observable variables Xt and hidden state variables Zt, p(x1:T ,z1:T )=p(x1|z1)p(z1) t=2 p(xt|zt)p(zt|zt 1). Here, Xs:t denotes {Xs,...Xt}. Given this state-space model, the filtering problem aims to infer sequentially the distributions p(zt|x1,...xt). Classically, filtering is solved by the Kalman filter, one of its nonlinear extensions (the extended Kalman filter (EKF) or unscented Kalman filter (UKF)), or a particle filter (S arkk a, 2013). These methods require knowledge of p(xt|zt) and p(zt|zt 1), however. KBF does not require knowing these distributions and learns them from samples (X1:T ,Z1:T ) of both observable and hidden variables in a training phase. Given a test sequence { x1,... x T }, KBF sequentially applies KBR to obtain kernel embedding m Zt| x1:t, where m Zt| x1:s denotes the embedding of the posterior distribution P(zt|X1:s = x1:s). This can be obtained by iterating the following two steps. Assume that we have the embedding m Zt| x1:t. Then, we can compute the embedding of forward prediction m Zt+1| x1:t by m Zt+1| x1:t = (EZt+1|Zt) m Zt| x1:t, where EZt+1|Zt is the conditional operator for P(zt+1|zt). Empirically, this is estimated from data {Zt,Zt+1}, ˆEZt+1|Zt = ( ˆCprev,prev + λ I) 1 ˆCprev,post, where λ is a regularizing coefficient and ˆCprev,prev = 1 T 1 t=1 ϕ(zt) ϕ(zt), ˆCprev,post = 1 T 1 t=1 ϕ(zt) ϕ(zt+1). Given the probability of forward prediction P(zt+1|x1:t), we can obtain P(zt+1|x1:t+1) by applying Bayes rule, p(zt+1|x1:t+1) p(xt+1|zt+1)p(zt+1|x1:t). Hence, we can obtain the filtering embedding m Zt+1| x1:t+1 at the next timestep by applying KBR with prior embedding mΠ = m Zt+1| x1:t and samples (X1:T ,Z1:T ) at the conditioning point x = xt+1. By repeating this process, we can conduct filtering for the entire sequence x1:t . Empirically, the estimated embedding ˆm Zt| x1:s is represented by a linear combination of features ϕ(zt) as Algorithm 2 Kernel Bayes Filter Input: Training Sequence {(xt,zt)}T t=1, regularizing parameter (η,λ,λ ), Test sequence { xt} T t=1. 1: Initialize w(0,1) 2: for t = 1 to T do 3: w(t,t) Algorithm 1 for data {(xt,zt)}T t=1, prior embedding ˆmΠ = P i=1 w(t 1,t) i ϕ(zi), conditioning point x = xt. 4: Compute w(t,t+1) with w(t,t+1) 1 = 0 and w(t,t+1) 2:T = (GZ 1 + (T 1)λ I) 1 GZ 1w(t,t) where (GZ 1)ij = k Z(zi,zj) RT 1 T 1 and ( GZ 1)ij = k Z(zi,zj) RT 1 T . 5: end for ˆm Zt| x1:s = P i=1 w(s,t) i ϕ(zi). The update equations for weights w(s,t) i are summarized in Algorithm 2. If we have any prior knowledge of P(z1), we can initialize w(0,1) accordingly. Otherwise, we can set w(0,1) = (1/T,...,1/T) , which initialization is used in the experiments. 5. Related Work on Neural Filtering Several recent methods have been proposed combining statespace models with neural networks (Krishnan et al., 2015; Klushyn et al., 2021; Rangapuram et al., 2018), aiming to learn the latent dynamic and the observation models from observed sequences X1:T alone. These approaches assume a parametric form of the latent dynamics: for example, the Deep Kalman filter (Krishnan et al., 2015) assumes that the distribution of the latents is Gaussian, with mean and covariance which are nonlinear functions of the previous latent state. Deep SSM (Rangapuram et al., 2018) assumes linear latent dynamics with Gaussian noise, and EKVAE (Klushyn et al., 2021) uses a locally linear Gaussian transition model. These models use the variational inference techniques to learn the parameters, which makes it challenging to prove the convergence to the true models. In contrast, KBF learns latent dynamics and observation model nonparametrically from samples, and the accuracy of the filtering is guaranteed from the convergence of KBR. 6. Experiments In this section, we empirically investigate the performance of our KBR estimator in a variety of settings, including the problem of learning posterior mean proposed in Fukumizu et al. (2013), as well as challenging filtering problems where the observations are high-dimensional images. 6.1. Learning Posterior Mean From Samples We revisit here the problem introduced by Fukumizu et al. (2013), which learns the posterior mean from samples. Let X Rd and Z Rd be generated from P(X,Z) = N((1 d ,0 d ) ,V ), where 1d = (1,...,1) Rd,0d = (0,...,0) Rd and V = 1 2d A A + 2I, with each component of A R2d 2d sampled from the standard Gaussian distribution on each run. The prior is set to be Π(Z) = N(0d,VZZ/2), where VZZ is the Z-component of V . We construct the prior embedding using 200 samples from Π(Z), and we aim to predict the mean of posterior EQ [Z|X] using n = 200 data points from P(X,Z), as the dimension d varies. Figure 1 summarizes the MSEs over 30 runs when the conditioning points x are sampled from N(0d,VXX). Here, Original denotes the performance of the original KBR estimator using operator ˆE(org) Q,λ in (2), and IW is computed from IW-KBR approach, which uses operator ˆEQ,λ in (4). In this experiment, we set η = λ = 0.2 and used Gaussian kernels for both KBR methods, where the bandwidth is given by the median trick. Unsurprisingly, the error increases as dimension increases. However, the IW-KBR estimator performs significantly better than the original KBR estimator. This illustrates the robustness of the IW-KBR approach even when the model is misspecified, since the correct EQ [Z|X], which is a linear function of X, does not belong to the Gaussian RKHS. To show how the quality of the density ratio estimate relates to the overall performance, we include the result of IWKBR with the true density ratio r0, denoted as IW-True Ratio in Figure 1. Surprisingly, IW-True Ratio performs worse than the original IWKBR, which estimates density ratio using Ku LSIF. This suggests that true density ratio yields a suboptimal bias/variance tradeoff, and greater variance for the finite sample estimate of the KBR posterior, compared with the smoothed estimate obtained from Ku LSIF. 6.2. Low-Dimensional KBF We consider a filtering problem introduced by Fukumizu et al. (2013), with latent Zt R2 and observation Xt R2. The dynamics of the latent Zt = (ut,vt) are given as follows. Let θt [0,2π] be cosθt = ut p u2 t + v2 t , sinθt = vt p u2 t + v2 t . The latent Zt+1 = (ut+1,vt+1) is then ut+1 vt+1 = (1 + β sin(Mθt)) cos(θt + ω) sin(θt + ω) for given parameters (β,M,ω). The observation Xt is given by Xt = Zt + ϵX. Here, ϵX and ϵZ are noise variables sampled from ϵX N(0,σ2 XI) and ϵZ N(0,σ2 ZI). Fukumizu et al. test performance using the Rotation dynamics ω = 0.3,b = 0 and Oscillatory dynamics ω = 0.4,b = 0.4,M = 8, where the noise level is set to σX = σZ = 0.2 in both scenarios. Using the same dynamics, we evaluate the performance of our proposed estimator by the MSE in predicting Zt from X1:t, where the length of the test sequence is set to 200. We repeated the experiments 30 times for each setting. Results are summarized in Figure 2: Original denotes the results for using original KBR approach in Algorithm 2, while IW used our IW-KBR approach. For both approaches, we used Gaussian kernels k X ,k Z whose bandwidths are set to βDX and βDZ, respectively. Here, DX and DZ are the medians of pairwise distances among the training samples. We used the Ku LSIF leave-one-out cross-validation procedure (Kanamori et al., 2012) to tune the regularization parameter η, and set λ = η. This leaves two parameters to be tuned: the scaling parameter β and the regularization parameter λ. These are selected using the last 200 steps of the training sequence as a validation set. We also include the results for the extended Kalman filter (EKF) and the particle filter (S arkk a, 2013). Figure 2 shows that the EKF and the particle filter perform slightly better than KBF methods in Rotation dynamics, which replicates the results in Fukumizu et al.. This is not surprising, since these methods have access to the true dynamics, which makes the tracking easier. In Oscillatory dynamics, however, which has a stronger nonlinearity, KBF displays comparable or better performance than the EKF, which suffers from a large error caused by the linear approximation. In both scenarios, IW-KBR slightly outperforms the original KBR. 6.3. High-Dimensional KBF Finally, we apply KBF to scenarios where observations are given by images. We set up two experiments: one uses high dimensional complex images while the latent follows simple dynamics, while the other considers complex dynamics with observations given by relatively simple images. In both cases, the particle filter performs significantly worse than the methods based on neural networks, since likelihood evaluation is unstable in the high dimensional observation space, despite the true dynamics being available. Deepmind Lab Video: The first high-dimensional KBF experiment uses Deep Mind Lab (Beattie et al., 2016), which is a 3D game-like platform for reinforcement learning. This platform provides a first-person viewpoint through the eyes of the agent. An example image can be found in Figure 3. Based on this platform, we introduce a filtering problem to estimate the agent s orientation at a specific point in the maze. 8 16 24 32 40 48 64 Dimension Original IW IW-True Ratio Figure 1. MSE as a function of dimension for posterior mean prediction Original IW EKF Particle Oscillatory Length of Training Sequence T Figure 2. Results for low-dimensional KBF experiments. The asterisks indicates the level of statistical significance (*: p < 0.05, **: p < 0.01, ***: p < 0.001, ****: p < 0.0001), which is tested by the Wilcoxon signed-rank test. Figure 3. Deep Mind Lab Figure 4. d Sprite The dynamics are as follows. Let θt [0,2π) be the direction that an agent is facing at time t. The next direction is θt+1 = θt + ω + ϵθ (mod 2π), where noise ϵθ N(0,σ2 θ). Let Zt = (cosθt,sinθt) and Xt be the image that corresponds to the direction θt + ϵX where ϵX N(0,σ2 X ). We used the RGB images downscaled to (36,48), which makes the observation 36 48 3 = 5184 dimensions. We added Gaussian noise N(0,0.01) to the each dimension of the observations Zt. We ran the experiments with ω = 0.4,σθ = 0.2,σX = 0.2, MSEs in 30 runs are summarized in Figure 5. 300 500 700 900 Length of Training Sequence T Original IW IW(NN) LSTM Figure 5. Result for Deep Mind Lab setting In addition to Original and IW, whose hyper-parameters are selected by the same procedure as in the lowdimensional experiments, we introduce two neural network based methods: IW(NN) and LSTM. IW(NN) uses adaptive feature discussed in Section 3.3. Here, instead of learning different features for each timestep, which would be time consuming and redundant, we learn adaptive feature ψθ by minimizing ˆℓ(P,λ)(E,θ) = 1 i=1 ϕ(zi) E ψθ(xi) 2 + λ E 2 with the learned parameter ˆθ being used for all timesteps. LSTM denotes an LSTM baseline (Hochreiter & Schmidhuber, 1997), which predicts Zt from features extracted from Xt 10:t. To make the comparison fair, LSTM used the same network architecture in the feature extractor as IW(NN). As in low-dimensional cases, IW consistently performs better for all sequence lengths (Figure 5). The baseline LSTM performs similarly to IW, however, even though it does not explicitly model the dynamics. This is because functions in the RKHS are not expressive enough to model the relationship between the direction and the images. This is mitigated by using adaptive features in IW(NN), which outperforms other methods by taking the advantage of the strong expressive power of neural networks. d Sprite Setting: The second high-dimensional KBF experiment uses d Sprite (Matthey et al., 2017), which is a dataset of 2D shape images as shown in Figure 4. Here, we consider the latent Zt following the same dynamics as (5), and the model observes the image Xt where the position of the shape corresponds to the noisy observation of the latent Zt + εX,εX N(0,σ2 XI). We used the Oscillatory dynamics (i.e. ω = 0.4,b = 0.4,M = 8) and set noise levels to σX = 0.05,σZ = 0.2. Again, we added Gaussian noise N(0,0.01) to the images. MSEs across 30 runs are summarized in Figure 6. Unlike in Deep Mind Lab setting, LSTM performs the worst in this setting. This suggests the advantage of KBF methods, which explicitly model the dynamics and exploit them in the filtering. Among KBF methods, IW and IW(NN) perform significantly better than Original, demonstrating the superiority of the IW approach. 7. Conclusion In this paper, we proposed a novel approach to kernel Bayes rule, IW-KBR, which minimizes a loss estimated by impor- 300 500 700 900 Length of Training Sequence T Original IW IW(NN) LSTM Figure 6. Result for d Sprite setting tance weighting. We established consistency of IW-KBR based on a novel analysis of an RKHS density ratio estimate, which is of independent interest. Our empirical evaluation shows that the IW-KBR significantly outperforms the existing estimator in both Bayesian inference and filtering for state-space models. Furthermore, by learning adaptive neural net features, IW-KBR outperforms a neural sequence model in filtering problems with high-dimensional image observations. In future work, we suggest exploring different density ratio estimation techniques for our setting. It is well-known in the density ratio estimation context that Ku LSIF estimator may suffer from high variance. To mitigate this, Yamada et al. (2011) proposed to use relative density-ratio estimation. Deriving a consistency result for such an estimator and applying it in kernel Bayes rule would be a promising approach. It would further be of interest to apply IW-KBR in additional settings, such as approximate Bayesian computation (Tavar e et al., 1997; Marjoram et al., 2003), as also discussed by Fukumizu et al. (2013). Acknowledgements This work was supported by the Gatsby Charitable Foundation. Beattie, C., Leibo, J. Z., Teplyashin, D., Ward, T., Wainwright, M., K uttler, H., Lefrancq, A., Green, S., Vald es, V., Sadik, A., Schrittwieser, J., Anderson, K., York, S., Cant, M., Cain, A., Bolton, A., Gaffney, S., King, H., Hassabis, D., Legg, S., and Petersen, S. Deepmind lab, 2016. Boots, B., Gordon, G., and Gretton, A. Hilbert space embeddings of predictive state representations. In Uncertainty in Artificial Intelligence, 2013. Caponnetto, A. and Vito, E. D. Optimal rates for the regu- larized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007. Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In International Conference on Machine Learning, 2016. Cucker, F. and Smale, S. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39:1 49, 2001. Fukumizu, K., Bach, F. R., and Jordan, M. I. Kernel dimension reduction in regression. The Annals of Statistics, 37 (4):1871 1905, 2009. Fukumizu, K., Song, L., and Gretton, A. Kernel Bayes rule: Bayesian inference with positive definite kernels. Journal of Machine Learning Research, 14(82):3753 3783, 2013. Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723 773, 2012. Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., and Pontil, M. Conditional mean embeddings as regressors. In International Conference on Machine Learning, 2012. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735 1780, 1997. Jasra, A., Singh, S. S., Martin, J. S., and Mc Coy, E. Filtering via approximate Bayesian computation. Statistics and Computing, 22(6):1223 1237, 2012. Jitkrittum, W., Gretton, A., Heess, N., Eslami, S., Lakshminarayanan, B., Sejdinovic, D., and Szabo, Z. Kernelbased just-in-time learning for passing expectation propagation messages. In Uncertainty in Artificial Intelligence, 2015. Kanagawa, M., Nishiyama, Y., Gretton, A., and Fukumizu, K. Filtering with state-observation examples via kernel Monte Carlo filter. Neural Computation, 28(2):382 444, 2016. Kanamori, T., Hido, S., and Sugiyama, M. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10(48):1391 1445, 2009. Kanamori, T., Suzuki, T., and Sugiyama, M. Statistical analysis of kernel-based least-squares density-ratio estimation. Machine Learning, 86(3):335 367, 2012. Klushyn, A., Kurle, R., Soelch, M., Cseke, B., and van der Smagt, P. Latent matters: Learning deep state-space models. In Advances in Neural Information Processing Systems, volume 34, pp. 10234 10245, 2021. Krishnan, R. G., Shalit, U., and Sontag, D. Deep kalman filters, 2015. Marjoram, P., Molitor, J., Plagnol, V., and Tavar e, S. Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences, 100(26):15324 15328, 2003. Matthey, L., Higgins, I., Hassabis, D., and Lerchner, A. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017. Muandet, K., Fukumizu, K., Dinuzzo, F., and Sch olkopf, B. Learning from distributions via support measure machines. In Advances in Neural Information Processing Systems, 2012. Muandet, K., Kanagawa, M., Saengkyongam, S., and Marukatat, S. Counterfactual mean embeddings. Journal of Machine Learning Research, 22(162):1 71, 2021. Nishiyama, Y., Kanagawa, M., Gretton, A., and Fukumizu, K. Model-based kernel sum rule. Machine Learning, 109: 939 972, 2020. Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, volume 20, 2008. Rangapuram, S. S., Seeger, M. W., Gasthaus, J., Stella, L., Wang, Y., and Januschowski, T. Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems, volume 31, 2018. Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. S arkk a, S. Bayesian Filtering and Smoothing. Cambridge University Press, 2013. Singh, R., Sahani, M., and Gretton, A. Kernel instrumental variable regression. In Advances in Neural Information Processing Systems, volume 32, 2019. Smale, S. and Zhou, D.-X. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26(2):153 172, 2007. Smola, A. J., Gretton, A., Song, L., and Sch olkopf, B. A Hilbert space embedding for distributions. In Algorithmic Learning Theory, volume 4754, pp. 13 31. Springer, 2007. Song, L., Huang, J., Smola, A., and Fukumizu, K. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In International Conference on Machine Learning, 2009. Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Sch olkopf, B., and Lanckriet, G. R. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517 1561, 2010. Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von B unau, P., and Kawanabe, M. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60:699 746, 2008. Szabo, Z., Sriperumbudur, B., Poczos, B., and Gretton, A. Learning theory for distribution regression. Journal of Machine Learning Research, 17(152):1 40, 2016. Tavar e, S., Balding, D. J., Griffiths, R. C., and Donnelly, P. Inferring coalescence times from dna sequence data. Genetics, 145(2):505 518, 1997. Williams, C. and Seeger, M. Using the Nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems, volume 13, 2001. Xu, L., Chen, Y., Srinivasan, S., de Freitas, N., Doucet, A., and Gretton, A. Learning deep features in instrumental variable regression. In International Conference on Learning Representations, 2021a. Xu, L., Kanagawa, H., and Gretton, A. Deep proxy causal learning and its application to confounded bandit policy evaluation. In Advances in Neural Information Processing Systems, 2021b. Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., and Sugiyama, M. Relative density-ratio estimation for robust distribution comparison. In Advances in Neural Information Processing Systems, 2011. A. Implication of Assumption 3.2 In this appendix, we will discuss the implication of Assumption 3.2 when we consider the ratio of two Gaussian distributions. Let π N(0,1) and p N(0,2). Then, the density ratio is r0(z) = π(z) which is in the RKHS HZ induced by Gaussian kernel k(z,z ) = exp( (z z )2/4). Indeed, we can see that Note that, from reproducing property, we have r0,f = 2f(0) for all f HZ. Given this, we can analytically compute the eigendecomposition of the covariance operator as C(P ) ZZ = X λiei(z)) ( p where λi = O(Bk) with constant B < 1 and ei is the i-th order Hermite polynomial (Rasmussen & Williams, 2006, Section 4.3). Assumption 3.2 requires (C(P ) ZZ ) β1r0 2 finite, meaning β1 is the maximum value that satisfies (C(P ) ZZ ) β1r0 2 = X i λ 2β1 i p λiei,r0 2 = 2 X i λ1 2β1 i e2 i (0) . B. Proof of Theorems In this appendix, we provide the proof of our theorems. B.1. Proof for Convergence Analysis We will rely on the following concentration inequality. Proposition B.1 (Lemma 2 of (Smale & Zhou, 2007)). Let ξ be a random variable taking values in a real separable RKHS H with ξ M almost surely, and let ξ1,...,ξn be i.i.d. random variables distributed as ξ. Then, for all n N and δ (0,1), i=1 ξi E [ξ] 2E [ ξ 2]log2/δ B.1.1. PROOF OF THEOREM 3.4 AND COROLLARY 3.5 We review some properties of the density ratio r0 used in Ku LSIF. Lemma B.2 ((Kanamori et al., 2009)). If the density ratio r0 HZ, then we have C(P ) ZZ r0 = mΠ, where mΠ = EΠ [ϕ(Z)]. Proof. From reproducing characteristics, we have C(P ) ZZ r0 = EP [r0(Z)ϕ(Z)] = EΠ [ϕ(Z)]. The last equality holds from the definition of the density ratio r0(z) = π(z)/p(z). Given Lemma B.2, we can bound the error of untruncated Ku LISF estimator in RKHS norm. Let rη be rη = ˆC(P ) ZZ + ηI 1 ˆmΠ. Note that the weight γi appearing in the original KBR (1) can be understood as the value of rη at specific data points: γi = rη(zi). Hence, the weight we use can be written as ˆri = max(0, rη(zi)). Although we use truncated weights as above in KBR, we first derive an upper-bound on the error r0 rη . Furthermore, we define the function rη which is a popular version of rη. rη = C(P ) ZZ + ηI 1 mΠ. Using rη, we can decompose the error as rη r0 rη rη + rη r0 . The second term can be bounded using an approach similar to Theorem 6 in Singh et al. (2019). Lemma B.3. Suppose Assumption 3.2, then we have rη r0 ηβ1ζ1. Proof. Let (νk,ek) be the eigenvalues and eigenfunctions of C(P ) ZZ , then rη r0 2 = ((C(P ) ZZ + ηI) 1C(P ) ZZ I)r0 2 νk νk + η 1 2 r0,ek 2 2 r0,ek 2 η k ν 2β1 k r0,ek 2 η νk + η 2 2β1 νk νk + η k ν 2β1 k r0,ek 2 = η2β1ζ2 1, where the first equality uses Lemma B.2, and the last equality holds from Assumption 3.2 and the fact that g1 2 = (C(P ) ZZ ) β1r0 2 = X k ν 2β1 k r0,ek 2 . We now establish a useful lemma on the norm of functions. Lemma B.4. rη r0 . Proof. Let (νk,ek) be the eigenvalues and eigenfunctions of C(P ) ZZ . From Lemma B.2, we have rη 2 = ((C(P ) ZZ + ηI) 1C(P ) ZZ )r0 2 k r0,ek 2 = r0 2. Now, we are ready to prove our convergence result on rη. Theorem B.5. Suppose Assumptions 3.1 and 3.2. Given data {xi,zi} P and the estimated prior embedding ˆmΠ, we have 4κ2 1 r0 log2/δ n + ˆmΠ EΠ [ϕ(X)] + ηβ1ζ1 with probability at least 1 δ for δ (0,2/e). Proof. We decompose the error as rη r0 rη rη + rη r0 . From Lemma B.3, we have rη r0 ηβ1ζ1. For the first term, we have rη rη = ( ˆC(P ) ZZ + ηI) 1( ˆC(P ) ZZ rη + ηrη ˆmΠ) ( ˆC(P ) ZZ + ηI) 1 ˆC(P ) ZZ rη + (mΠ C(P ) ZZ rη) ˆmΠ) ( ˆC(P ) ZZ C(P ) ZZ )rη + ˆmΠ EΠ [ϕ(X)] . By applying Proposition B.1 with ξ = (ϕ(Z) ϕ(Z))rη, we have ( ˆC(P ) ZZ C(P ) ZZ )rη 2κ2 1 r0 log(2/δ) 2κ4 1 r0 2 log(2/δ) n with probability 1 δ since ξ ϕ(Z) ϕ(Z) rη κ2 1 rη κ2 1 r0 , E ξ 2 κ4 1 rη 2 κ4 1 r0 2 from Lemma B.4. Since 2log(2/δ) 2log(2/δ) for δ (0,2/e), we finally obtain 4κ2 1 r0 log2/δ n + ˆmΠ EΠ [ϕ(X)] + ηβ1ζ1. Given Theorem B.5, Theorem 3.4 is now easy to prove. Proof of Theorem 3.4. Note that we have ˆr r0(z) = max z Z |max( rη(z),0) r0(z)| max z Z | rη(z) r0(z)| rη r0 max z Z ϕ(z) where the first inequality uses the fact that density ratio r0 is non-negative. From the upper-bound in Theorem B.5, we obtain ˆr r0(z) κ1 rη r0 4κ2 1 r0 log2/δ n + ˆmΠ EΠ [ϕ(X)] + ηβ1κ1ζ1. with probability 1 δ for δ (0,2/e). Hence, if ˆmΠ EΠ [ϕ(X)] Op(n α1), by setting η = O(n α1 β1+1 ), we have ˆr r0(z) Op(n α1β1 Furthermore, we can show the following convergence result on the covariance operator as follows. Proof of Corollary 3.5. From Theorem 3.4, with probability 1 δ for δ (0,2/e), we have max i |ri r0(zi)| ˆr r0 κ1 4κ2 1 r0 log2/δ n + ˆmΠ EΠ [ϕ(X)] + ηβ1κ1ζ1. Let C(Q) XX be the empirical covariance operator with true density ratio r0: i=1 r0(zi)ψ(xi) ψ(xi). Then, we have ˆC(Q) XX C(Q) XX ˆC(Q) XX C(Q) XX + C(Q) XX C(Q) XX . For the first term, we have ˆC(Q) XX C(Q) XX 1 i |ri r0(zi)| ψ(xi) ψ(xi) 4κ2 1 r0 log2/δ n + ˆmΠ EΠ [ϕ(X)] + ηβ1κ1κ2 2ζ1. For the second term, by applying Proposition B.1 with ξ = r0(Z)(ψ(X) ψ(X)), we have C(Q) XX C(Q) XX 2κ1κ2 2 r0 log2/δ 2(κ1κ2 2 r0 )2 log2/δ 4κ1κ2 2 r0 log2/δ n with probability 1 δ for δ (0,2/e) since ξ = |r0| ψ(X) ψ(X) κ1κ2 2 r0 , E ξ 2 (κ1κ2 2 r0 )2. Hence, we have ˆC(Q) XX C(Q) XX ˆC(Q) XX C(Q) XX + C(Q) XX C(Q) XX 4κ2 1 r0 log2/δ n + ˆmΠ EΠ [ϕ(X)] + ηβ1κ1κ2 2ζ1 + 4κ1κ2 2 r0 log2/δ n with probability 1 2δ for δ (0,2/e). Thus, if ˆmΠ EΠ [ϕ(X)] Op(n α1), by setting η = O(n α1 β1+1 ), we have ˆC(Q) XX C(Q) XX Op(n α1β1 B.1.2. PROOF OF THEOREM 3.6 Next, we derive Theorem 3.6. Again, we define the operator EQ,λ which replaces the empirical estimates in ˆEQ,λ by their population versions. EQ,λ = (C(Q) XX + λI) 1C(Q) XZ By following a similar approach as in Lemma B.3, we obtain the following result. Lemma B.6 (Theorem 6 in (Singh et al., 2019)). Suppose Assumption 3.3, then we have EQ,λ EQ λβ2ζ2. Using a proof similar to the one of Lemma B.4, we also have Lemma B.7. EQ,λ EQ . Given these lemmas, we can prove Theorem 3.6 as follows. Proof of Theorem 3.6. We decompose the error as follows: EQ ˆEQ,λ EQ EQ,λ + EQ,λ ˆEQ,λ . Lemma B.6 bounds the first term as EQ,λ EQ λβ2ζ2. The second term can be bounded as follows EQ,λ ˆEQ,λ = ( ˆC(Q) XX + λI) 1( ˆC(Q) XXEQ,λ + λEQ,λ ˆC(Q) XZ) ( ˆC(Q) XX + λI) 1 ( ˆC(Q) XXEQ,λ + λEQ,λ ˆC(Q) XZ) λ ( ˆC(Q) XXEQ,λ + λEQ,λ ˆC(Q) XZ) λ ( ˆC(Q) XXEQ,λ + (C(Q) XZ C(Q) XXEQ,λ) ˆC(Q) XZ) λ ( ˆC(Q) XX C(Q) XX)EQ,λ ( ˆC(Q) XZ C(Q) XZ) λ( ˆC(Q) XX C(Q) XX) EQ,λ + ˆC(Q) XZ C(Q) XZ ) λ( ˆC(Q) XX C(Q) XX) EQ + ˆC(Q) XZ C(Q) XZ ) where the last inequality uses Lemma B.7. Therefore, we have λ( ˆC(Q) XX C(Q) XX EQ + ˆC(Q) XZ C(Q) XZ ) + λβ2ζ2. Hence, if ˆC(Q) XX C(Q) XX Op(n α2) and ˆC(Q) XZ C(Q) XZ Op(n α2) , we can see EQ ˆEQ,λ Op(n α2β2 by setting λ = O(n α2 β2+1 ). B.2. Proof for Adaptive Features Here, we derive the loss function ℓ(θ) for adaptive feature. Let Φ = [ϕ(z1),...,ϕ(zn)] be the feature map for Z. Then the loss ˆℓ(Q,λ) can be written as ˆℓ(Q,λ)(E,θ) = tr (Φ E Ψθ)D (Φ E Ψθ) + λ E 2 = tr ΦDΦ 2E ΨθDΦ + E ΨθDΨ θ E + λ E 2. Since for fixed θ, the minimizer of ˆℓ(Q,λ) with respect to E can be written as ˆEQ,λ = (ΨθDΨ θ + λI) 1ΨθDΦ , we have ˆℓ(Q,λ)( ˆEQ,λ,θ) = tr ΦDΦ 2tr ΦDΨ θ (ΨθDΨ θ + λI) 1ΨθDΦ + tr ΦDΨ θ (ΨθDΨ θ + λI) 1ΨθDΨ θ (ΨθDΨ θ + λI) 1ΨθDΦ + tr λΦ DΨθ(Ψ θ DΨθ + λI) 1(Ψ θ DΨθ + λI) 1Ψ θ DΦ = tr ΦDΦ tr ΦDΨ θ (ΨθDΨ θ + λI) 1ΨθDΦ . Using GZ = Φ Φ, we have ℓ(θ) = ˆℓ(Q,λ)( ˆEQ,λ,θ) = tr(GD) tr ΦDΨ θ (GDΨ θ + λI) 1ΨθDΦ = tr GD GDΨ θ (ΨθDΨ θ + λI) 1ΨθD .