# recursive_estimation_of_conditional_kernel_mean_embeddings__ec065d98.pdf

Journal of Machine Learning Research 25 (2024) 1-35 Submitted 2/23; Revised 7/24; Published 8/24

Recursive Estimation of Conditional Kernel Mean Embeddings

Ambrus Tam as1,2 ambrus.tamas@sztaki.hu

Bal azs Csan ad Cs aji1,2 balazs.csaji@sztaki.hu

1Institute for Computer Science and Control (SZTAKI) Hungarian Research Network (HUN-REN), Kende utca 13-17, Budapest, H-1111, Hungary

2Department of Probability Theory and Statistics Institute of Mathematics, E otv os Lor and University (ELTE) P azm any P eter S et any 1 / C, Budapest, H-1117, Hungary

Editor: Kenji Fukumizu

Kernel mean embeddings, a widely used technique in machine learning, map probability distributions to elements of a reproducing kernel Hilbert space (RKHS). For supervised learning problems, where input-output pairs are observed, the conditional distribution of outputs given the inputs is a key object. The input dependent conditional distribution of an output can be encoded with an RKHS valued function, the conditional kernel mean map. In this paper we present a new recursive algorithm to estimate the conditional kernel mean map in a Hilbert space valued L2 space, that is in a Bochner space. We prove the weak and strong L2 consistency of our recursive estimator under mild conditions. The idea is to generalize Stone s theorem for Hilbert space valued regression in a locally compact Polish space. We present new insights about conditional kernel mean embeddings and give strong asymptotic bounds regarding the convergence of the proposed recursive method. Finally, the results are demonstrated on three application domains: for inputs coming from Euclidean spaces, Riemannian manifolds and locally compact subsets of function spaces.

Keywords: conditional kernel mean embeddings, recursive estimation, reproducing kernel Hilbert spaces, strong consistency, nonparametric inference, Riemannian manifolds

1. Introduction

In statistical learning we need to work with probability distributions deﬁned on a wide variety of objects, which may not even come from a linear space. Kernel mean embeddings (KMEs) oﬀer a neat representation for dealing with distributions by mapping them into a reproducing kernel Hilbert space (RKHS). These embeddings were deﬁned in (Berlinet and Thomas-Agnan, 2004) and (Smola et al., 2007). In machine learning often the conditional distribution of an output w.r.t. some input plays a key role. In these problems the concept of conditional kernel mean embeddings (CKME) is more adequate. The notion of CKME was ﬁrst recognized in (Song et al., 2009), then a reﬁned deﬁnition was given in (Song et al., 2013). Its rigorous measure theoretic background was presented in (Klebanov et al.,

c 2024 Ambrus Tam as & Bal azs Csan ad Cs aji.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0168.html.

Tam as and Cs aji

2020) and in (Park and Muandet, 2020) as an input dependent random element in an RKHS. These concepts strongly rely on the notions of Bochner integral and conditional expectation in Banach spaces of which theory is presented in (Pisier, 2016; Hyt onen et al., 2016) and Appendix A. Under some stringent assumptions Song et al. (2009) estimate the conditional kernel mean map with an empirical variant of the regularized inverse crosscovariance operator. It is shown in (Gr unew alder et al., 2012) that this estimate can also be deduced from a regularized risk minimization scheme in a vector-valued RKHS based on (Micchelli and Pontil, 2005). In (Park and Muandet, 2020) a consistency theorem is proved when the conditional kernel mean map is included in a vector-valued RKHS.

Related works In this paper we follow the general distribution-free framework presented in (Gy orﬁet al., 2002). It is recognized that estimating the conditional kernel mean map is a particular vector-valued regression problem. Vector-valued regression was already studied in (Ramsay and Silverman, 2005; Bosq and Delecroix, 1985; Dabo-Niang and Rhomari, 2009; Forzani et al., 2012; Brogat-Motte et al., 2022). Nevertheless, it is still challenging to provide suﬃcient conditions for strong L2 (mean square) consistency that can be applied in practice. We also build on the theory of stochastic approximation (Kushner and Yin, 2003; Ljung et al., 2012). Our core algorithm uses a stochastic update in each iteration. The presented method was motivated by the celebrated stochastic gradient descent algorithm, see for example (Bertsekas and Tsitsiklis, 1996, Proposition 4.2). The update term of our recursive algorithm is a modiﬁed gradient of an empirical surrogate loss function.

Contributions Our main contribution is a weakly and strongly consistent recursive algorithm for estimating the conditional kernel mean map under general structural assumptions. We present and prove a generalized version of Stone s theorem (Stone, 1977), motivated by the recursive form in (Gy orﬁet al., 2002, Theorem 25.1). Our generalization is twofold. We deal with general locally compact Polish input spaces and allow the output space to be Hilbertian. Our main assumption is that the measure of metric balls goes to zero slowly w.r.t. the radius. In (Forzani et al., 2012) a similar consistency result is proved, however the authors of that paper do not deal with functional outputs and also require the so-called Besicovitch condition to hold on the regression function. In (Dabo-Niang and Rhomari, 2009) suﬃcient conditions for L2 consistency are given for vector-valued regression, however, in this paper a diﬀerentiation condition is required for the regression function and also the rate of the small ball probabilities is bounded from below when the radius goes to zero. First, in Section 2 we present a recursive scheme for (unconditional) kernel mean embeddings. A stochastic approximation based theorem (Ljung et al., 2012) is used to prove its strong consistency. Then, in Section 3 we introduce a recursive estimator for the conditional kernel mean map. Our algorithm is formulated in a ﬂexible oﬄine manner in such a way that the online algorithm is also covered. We prove the weak L2 consistency for the general local averaging scheme in locally compact Polish spaces under very mild assumptions. We apply this consistency result to prove the strong consistency of local averaging kernel estimates with a (measurable) nonnegative, bounded and symmetric smoother sequence. This scheme requires a smoother sequence with appropriate shrinking properties w.r.t. the unknown marginal measure. We satisfy these requirements by a general mother smoother function induced sequence parameterized by two shrinkage rates. In the ﬁnal theorem only a structural condition on the probability of small balls remains. Finally, in Section 4 we

Recursive Estimation of Conditional Kernel Mean Embeddings

apply our results for three arch-typical cases. We deduce universal consistency for Euclidean spaces, Riemannian manifolds and locally compact subsets of Hilbert spaces.

Applications Here, we highlight some potential application domains of the ideas. Based on the theory of KMEs and CKMEs, a rich variety of applications were introduced. Kernel mean embeddings can be eﬃciently used, for example, in graphical models (Song et al., 2013), Bayesian inference (Fukumizu et al., 2013), dynamical systems (Song et al., 2009), reinforcement learning (Gr unew alder et al., 2012), causal inference (Sch olkopf et al., 2015) and hypothesis testing for independence (Gretton et al., 2008; Gretton and Gy orﬁ, 2008; Gretton et al., 2012), conditional independence (Fukumizu et al., 2007; Zhang et al., 2011), and binary classiﬁcation (Cs aji and Tam as, 2019) and (Tam as and Cs aji, 2022).

Recursive algorithms are crucial to handle data streams or ﬁxed, but very large datasets, cf. divide-and-conquer. Iterative methods play a key role in, e.g., reinforcement learning, system identiﬁcation, signal processing and adaptive control. Therefore, developing recursive estimators for conditional kernel mean embeddings is beneﬁcial for many domains.

2. Kernel Mean Embeddings of Distributions

Kernel methods oﬀer an elegant approach to deal with abstract sample spaces. One only needs to specify a similarity measure on the space via the kernel function to provide a geometrical structure for the learning problem. In fact, every symmetric, positive deﬁnite function, a.k.a. kernel, deﬁnes a reproducing kernel Hilbert space. The user-chosen kernel function also determines a natural projection from the sample space into the RKHS. The projected values of sample points are called feature vectors, or features. This projection map can be easily extended to (probability) measures deﬁning kernel mean embeddings. In this section we present the notion of KMEs in a statistical learning framework.

2.1 Statistical Framework

Let (Ω, A, P) be a probability space, where Ωis the sample space, A is the σ-algebra of events, and P is the probability measure. We consider a random pair (X, Y ) from a measurable product space X Y with some product σ-algebra X Y. Let us consider a metric on X and let us denote its induced Borel σ-algebra as X. We denote the joint distribution of (X, Y ) by Q (or QX,Y ) and the corresponding marginal distributions by QX and QY . In practice, distribution Q is usually unknown, however we assume that:

A1 An independent identically distributed (i.i.d.) sample {(Xi, Yi)}n i=1 is given.

In the next section we present the deﬁnition of kernel mean embeddings of probability distributions into an RKHS associated with kernel ℓ: Y Y R. With a somewhat imprecise terminology, we also talk about the kernel mean embedding of a random variable into an RKHS, by which we mean taking the KME of its distribution. Explanatory variable X will only be used for the conditional variant of KME to generate the conditioning σalgebra, consequently, for the unconditional KME, we will only use one variable, Y . In the subsequent section we focus on conditional kernel mean embeddings. The conditional kernel mean embedding of Y with respect to X into a given RKHS is the conditional expectation of the random feature deﬁned by the kernel and variable Y with respect to the

Tam as and Cs aji

σ-algebra generated by X, i.e., a conditional kernel mean embedding is a random feature vector in an RKHS measurable to some explanatory variable. The main object for CKME is the conditional kernel mean map which maps the explanatory points into RKHS G. This function maps (almost) every input point to the kernel mean embedding of the conditional distribution given that input. For a linear kernel one can recover the regression function as a particular conditional kernel mean map. When X and Y are independent, the conditional kernel map is a constant function mapping every input into the unconditional KME of Y .

2.2 Reproducing Kernel Hilbert Spaces

Let ℓbe a (real-valued, symmetric, positive deﬁnite) kernel on Y Y, where Y is a nonempty set, i.e., for all n N .= {1, 2, . . . }, y1, . . . , yn Y and a1, . . . , an R, we have

j=1 aiaj ℓ(yi, yj) 0, (1)

or equivalently kernel matrix L Rn n, where Li,j .= ℓ(yi, yj) for i, j [n] .= {1, . . . , n}, is required to be positive semideﬁnite. A Hilbert space of functions, G = {g : Y R}, is called a reproducing kernel Hilbert space if every point evaluating (linear) functional, δy : G R deﬁned by δy(g) = g(y) for y Y, is bounded (or equivalently continuous). In (Aronszajn, 1950) it is proved that for every (symmetric and positive deﬁnite) kernel ℓthere uniquely (up to isometry) exists an RKHS G, such that ℓ( , y) G and

g, ℓ( , y) G = g(y) (2)

holds for all g G and y Y. In particular, ℓ( , y1), ℓ( , y2) G = ℓ(y1, y2) is true. In the reverse direction, by the Riesz representation theorem (Lax, 2002, Theorem 4), for every RKHS G there uniquely exists a (symmetric and positive deﬁnite) kernel ℓ: Y Y R such that (2) holds for all g G and y Y. Function ℓis called the reproducing kernel of space G. For further details, see (Sch olkopf and Smola, 2002; Berlinet and Thomas-Agnan, 2004; Shawe-Taylor and Cristianini, 2004; Steinwart and Christmann, 2008).

2.3 Kernel Mean Embeddings of Marginal Distributions

In this paper we will assume that space G is separable. One can observe that if Y is a separable topological space and kernel ℓis continuous, then the separability of G follows immediately, see (Steinwart and Christmann, 2008, Lemma 4.33). Speciﬁcally, we assume

B1 Kernel ℓis measurable, E p

ℓ(Y, Y ) < , and RKHS G is separable.

In this case, by the theorem of Pettis (Hyt onen et al., 2016, Theorem 1.1.20), ℓ( , Y ) : Ω G is also measurable. Then, the kernel mean embedding of µY exists if and only if B1 holds (Hyt onen et al., 2016, Proposition 1.2.2). Thus, one can embed QY into RKHS G by

µY .= E ℓ( , Y ) , (3)

where the expectation is a Bochner integral. It is known (Smola et al., 2007) that if B1 holds, then by the Riesz representation theorem µY G and for all g G, we have

µY , g G = E g(Y ) , (4)

Recursive Estimation of Conditional Kernel Mean Embeddings

i.e., µY is the representer of the expectation functional on G w.r.t. QY .

Given an i.i.d. sample {Yi}n i=1, having distribution QY , we can estimate the kernel mean embedding of QY with the empirical average (a.k.a. sample mean), that is

i=1 ℓ( , Yi). (5)

By the strong law of large numbers, bµn µY almost surely (Taylor, 1978). We refer to (Tolstikhin et al., 2017, Proposition A.1.) to quantify the convergence rate of this estimate:

Theorem 1 Let {Yi}n i=1 be an i.i.d. sample from distribution QY on a separable topological space Y. Assume that ℓ( , y) is continuous and there exists Cℓ> 0 such that ℓ(y, y) Cℓ for all y Y. Then, for all δ (0, 1) with probability at least 1 δ, we have

It is also proved that this rate is minimax for a broad class of distributions. Note that only the measurability of ℓ( , y) is used in the proof of Tolstikhin et al. (2017).

2.4 Recursive Estimation of Kernel Mean Embeddings

As in the scalar case, these empirical estimates can be computed recursively by

bµ1 .= ℓ( , Y1), and

n bµn 1 + 1

n ℓ( , Yn) = bµn 1 + 1

ℓ( , Yn) bµn 1 , (7)

for 2 n N. More generally, one can use some possibly random stepsize (or learning rate) sequence (an)n N, taking values in (0, 1), and apply the reﬁned recursion deﬁned by

bµ1 .= ℓ( , Y1), and

bµn .= bµn 1 an bµn 1 ℓ( , Yn) , (8)

for n 2. We prove the strong consistency of (bµn)n N by applying Theorem 1.9 of Ljung et al. (2012), which is restated (for convenience) as Theorem 24 in Appendix D.

Theorem 2 Let Y1, Y2, . . . be an i.i.d. sequence, Cℓ> 0 be a constant such that ℓ(y, y) Cℓ, for all y Y, and assume B1. Let (bµn)n N be the estimate sequence deﬁned in (8). If an 0, P n a2 n < and P n an = (a.s.), then bµn µY , as n , almost surely.

Proof It is suﬃcient to check the conditions of Theorem 24 for Λ(g) .= g µY , µ .= µY , µn .= bµn and Wn .= ℓ( , Yn) µY . The conditions on sequence (an)n N are assumed, hence we only need to verify (i), (ii) and (iii) of Theorem 24. Condition (i) holds with c .= max µY G , 1 . Condition (ii) is again straigthforward, because Λ(g), g µY G = g µY 2 G. For condition (iii) let Hn .= 0 and Vn .= Wn. Using the independence of {Yi}n i=1 yields

E Vn | bµ1, V1, . . . , bµn 1, Vn 1 = E ℓ( , Yn) µY | Y1, . . . , Yn 1

= E ℓ( , Yn) µY = 0, (9)

Tam as and Cs aji

hence (Vn)n N is a martingale diﬀerence sequence. Finally

E h ℓ( , Yn) 2 G i +

E h µY 2 G i

E h ℓ(Yn, Yn) i + µY G p

Cℓ+ µY G , (10)

thus P n a2 n E Vn 2 G < . Consequently, Theorem 2 follows from Theorem 24.

3. Conditional Kernel Mean Embeddings

If the kernel mean embedding of QY exists, then for a given σ-algebra one can take the conditional expectation of ℓ( , Y ). The conditional kernel mean embedding of Y given X in RKHS G is an X measurable random element of G deﬁned by

µY |X .= E ℓ( , Y ) | X . (11)

For more details on conditional expectations in Banach spaces, see (Pisier, 2016) and Appendix A. It is known that the conditional expectation is well-deﬁned if E[ ℓ( , Y ) ] exists, thus we assume that E[ p

ℓ(Y, Y ) ] < as above. By (Kallenberg, 2002, Lemma 1.13) there exists a measurable map µ : X G such that µY |X = µ (X) (a.s.). One can use the pointwise form of µ deﬁned QX-a.s. by µ (x) = E[ ℓ( , Y ) | X = x ]. This function plays a key role, and we call it the conditional kernel mean map. By equation (94) we have that

µY |X, g G = µ (X), g G = E ℓ( , Y ) | X , g G = E h ℓ( , Y ), g G | X i = E g(Y ) | X (12)

holds for all g G. In addition, by the law of total expectation we have E[ µY |X ] = µY .

Recovering the regression function Let us consider the special case when X .= Rd, Y .= R and ℓ(y1, y2) = y1 y2. Then, the RKHS generated by ℓcan be simply identiﬁed with R by mapping ℓ(y, a) .= y a to a and the conditional kernel mean map takes the form of (µ (X))(y) = E[ Y y | X ] = y E[ Y | X ]. Therefore, by linearity, µ is equivalent to the regression function, i.e., µ (X) is the linear function determined by E[Y |X ].

Recovering a family of real-valued regression For a function g G, the regression of g(Y ) given X is a real-valued regression problem for which the results of Gy orﬁet al. (2002) can be applied. One of the main advantages of dealing with CMKEs is to produce solutions for every g G simultaneously. Moreover, the information about the conditional distribution of the outputs, given an input, can be used for various types of inference.

3.1 Universal Consistency

The main challenge addressed by the paper is to construct an estimate sequence (µn)n N for µ with strong stochastic guarantees. Since the probability distribution of the given sample is unknown, in general our estimate will not be equal to µ in ﬁnite steps. In order to measure the loss of an estimate sequence, one may choose from several error criteria. In

Recursive Estimation of Conditional Kernel Mean Embeddings

this paper we use the vector valued L2 risk with respect to QX. The main reason leading to this choice of L2 criterion is that conditional expectation can be deﬁned as a projection in an L2 space. We strengthen our assumption to ensure that the examined functions are included in the appropriate Bochner spaces. Appendix A contains an overview on Bochner spaces and Bochner integrals, where our notations are precisely deﬁned.

A2 Kernel ℓis measurable, E ℓ(Y, Y ) < , and RKHS G is separable.

Lemma 3 If A2 holds, then µ L2(X, QX; G).

Proof We can see that from A2 it follows immediately that ℓ( , Y ) L2(Ω, P; G) because Z

Ω ℓ( , Y ) 2 G d P = E ℓ(Y, Y ) < . (13)

Moreover by (93), (Hyt onen et al., 2016, Prop. 2.6.31.), we have that Z

X µ (x) 2 G d QX(x) = E µ (X) 2 G = E E[ℓ( , Y ) | X] 2 G

E E[ ℓ( , Y ) 2 G | X] = E[ℓ(Y, Y )] < , (14)

hence µ L2(X, QX; G).

For scalar-valued regression, if E[Y 2] < holds, then we have that the conditional expectation is an orthogonal projection to L2(Ω, σ(X), P|σ(X)), where σ(X) is the σ-algebra generated by X and P|σ(X) is the restriction of P into σ(X). Similarly:

Lemma 4 Under A2, µ (X) is an orthogonal projection of ℓ( , Y ) to L2(Ω, σ(X), P|σ(X); G).

Proof For any µ L2(X, QX; G) it holds (Park and Muandet, 2020, Theorem 4.2) that

E(µ) .= E ℓ( , Y ) µ(X) 2 G = E ℓ( , Y ) µ (X) + µ (X) µ(X) 2 G

= E ℓ( , Y ) µ (X) 2 G + E µ (X) µ(X) 2 G

+ 2 E h E ℓ( , Y ) µ (X), µ (X) µ(X) G | X i

= E ℓ( , Y ) µ (X) 2 G + E µ (X) µ(X) 2 G

+ 2 E E[ℓ( , Y ) | X] µ (X), µ (X) µ(X) G

= E ℓ( , Y ) µ (X) 2 G + E µ (X) µ(X) 2 G .

Thus, µ minimizes E in L2(X, QX; G).

Note that E ℓ( , Y ) µ (X) 2 G is independent of µ, therefore it is reasonable to apply E µ (X) µ(X) 2 G as an error criterion. This argument motivates the following notions of L2(X, QX; G) consistency. Recall that QX is the marginal distribution of Q w.r.t. X.

Deﬁnition 5 Let (µn)n N be a (random) estimate sequence for µ . The (random) L2 error (or more precisely L2(X, QX; G) error) of µn is deﬁned by Z

X µ (x) µn(x) 2 G d QX(x). (16)

Tam as and Cs aji

We say that (µn)n N is weakly consistent for the distribution of (X, Y ), if as n ,

X µ (x) µn(x) 2 G d QX(x) 0. (17)

We say that (µn)n N is strongly consistent for the distribution of (X, Y ), if as n ,

X µ (x) µn(x) 2 G d QX(x) a.s. 0. (18)

One can see that strong consistency does not necessarily imply weak consistency, indeed neither of these two consistency notions imply the other. Nevertheless, knowing that an estimator is weakly consistent will be of great help to also prove its strong consistency. Note that it can happen that an estimate sequence is consistent for some distributions of (X, Y ) and inconsistent for other distributions of (X, Y ). We say that (µn)n N is universally weakly (strongly) consistent if it is weakly (strongly) consistent for all possible distributions of (X, Y ). It is a somewhat surprising fact that there exist such estimates for Rd R type regression functions, see (Stone, 1977), for example, local averaging adaptive methods such as k-Nearest Neighbors (k NN), several kernel-based methods and various partitioning rules were proved to be universally consistent, see the book of Gy orﬁet al. (2002). Recently Hanneke et al. (2020) proved that there exists a universally strongly consistent classiﬁer (Opti Net) for any separable metric space and Gy orﬁand Weiss (2021) constructed a universally consistent classiﬁcation rule called Proto-NN for multiclass classiﬁcation in metric spaces whenever the space admits a universally consistent estimator. For Banach space valued regression in semi-metric spaces Dabo-Niang and Rhomari (2009) constructed estimates with asymptotic bounds under some diﬀerentiation condition on the conditional expectation function and Forzani et al. (2012) proved consistency under the Besicovitch condition. Nevertheless, it is still an open question whether there exist universally consistent estimates under more general conditions (e.g., for Hilbert space valued regression).

3.2 Generalization of Stone s Theorem

In this section, we present a generalization of Stone s theorem (Stone, 1977) for conditional kernel mean map estimates for locally compact Polish1 input spaces. This theorem will serve as our main theoretical tool to prove consistency results for our new recursive kernel algorithm. We consider local averaging methods of the general form

i=1 Wn,i(x) ℓ( , Yi), (19)

for the conditional kernel mean map, E[ ℓ( , Y ) | X = x ], in a locally compact Polish space (X, d). A Polish space gives a convenient setup for probability theory and locally compactness is required because of Lemma 26. We use the fact that the set of compactly supported continuous functions is dense in L2(X, QX) for every probability measure QX when X is

1. A Polish space is a separable, completely metrizable topological space; as the speciﬁc choice of the metric is unimportant for the problems studied in this paper, we ﬁx an arbitrary metric d which is compatible with the topology of our space, and treat a Polish space as a (complete and separable) metric space.

Recursive Estimation of Conditional Kernel Mean Embeddings

locally compact, see Theorem 21. We deal with data-dependent probability weight functions Wn,i(x) for n N and i [n], i.e., we assume that for all n N, weight Wn,i(x) is nonnegative for all i [n] and Pn i=1 Wn,i(x) = 1 for all x X. These are somewhat stronger conditions compare to the original theorem of Stone, however, in practice they are usually satisﬁed and simplify the proof. For several well-known methods (e.g, k NN, partitioning rules, Nadaraya Watson estimates) these assumptions hold immediately.

Theorem 6 Let (X, d) be a locally compact Polish space and let X be the Borel σ-algebra on X. Assume A1 and A2. Let µ (x) .= E ℓ( , Y ) | X = x and

i=1 Wn,i(x) ℓ( , Yi). (20)

Assume that:

(i) For all n N, i [n] and x X, we have

0 Wn,i(x) 1, and

i=1 Wn,i(x) = 1. (21)

(ii) There exists c > 0 such that for all f : X R+ with E[f(X)] < , where R+ denotes the nonnegative real numbers, we have

i=1 Wn,i(X)f(Xi) c E[f(X)]. (22)

(iii) For all δ > 0, we have

lim n E n X

i=1 Wn,i(X) I d(Xi, X) > δ = 0. (23)

(iv) The following convergence holds for the weight functions:

lim n E n X

i=1 W 2 n,i(X) = 0. (24)

Then, the estimate sequence (µn) is weakly consistent, that is

X µn(x) µ (x) 2 G d QX(x) = 0. (25)

Condition (i) ensures that {Wn,i}n i=1 are probability weights for all n N. Assumption (ii) intuitively says that the L1 norm of the local averaging estimate is at most some positive constant times the L1 norm of the estimated function whenever there is no noise on the observations, i.e., when Yi = f(Xi). Condition (iii) guarantees that asymptotically only those inputs aﬀect the estimate in a point x which are close to x. Condition (iv) ensures that the individual weights vanish as the sample size goes to inﬁnity.

Tam as and Cs aji

Proof The proof is based on the proof of Stone (1977) presented in (Gy orﬁet al., 2002). By ﬁrst using (a + b)2 2a2 + 2b2, we have

E h µn(X) µ (X) 2 G i (26)

i=1 Wn,i(X) ℓ( , Yi) µ (Xi)

i=1 Wn,i(X) µ (Xi) µ (X)

Then, the application of Lemma 7 and Lemma 8 completes the proof.

Lemma 7 Under the conditions of Theorem 6, we have that, as n ,

i=1 Wn,i(X) µ (Xi) µ (X)

Proof Let us use the following notation

i=1 Wn,i(X) µ (Xi) µ (X)

By Theorem 21 there exists a function µ which is bounded and continuous and

E h µ (X) µ (X) 2 G i < ε. (29)

Recall that by the theorem of Heine and Cantor, see Theorem 27 in Appendix D, a continuous function with compact support is always uniformly continuous. Using the convexity of f(x) = x G and also the elementary fact that (a + b + c)2 3a2 + 3b2 + 3c2 yield

i=1 Wn,i(X) µ (Xi) µ (X)

i=1 Wn,i(X) µ (Xi) µ (X) G

i=1 Wn,i(X) µ (Xi) µ (X) 2 G

i=1 Wn,i(X) µ (Xi) µ (Xi) 2 G

i=1 Wn,i(X) µ (Xi) µ (X) 2 G

i=1 Wn,i(X) µ (X) µ (X) 2 G

.= 3 J(1) n + 3 J(2) n + 3 J(3) n .

Recursive Estimation of Conditional Kernel Mean Embeddings

We bound these three terms separately. For any δ > 0 one can see that

i=1 Wn,i(X) µ (Xi) µ (X) 2 G I d(Xi, X) > δ #

i=1 Wn,i(X) µ (Xi) µ (X) 2 G I d(Xi, X) δ #

i=1 Wn,i(X)2 µ (Xi) 2 G + µ (X) 2 G I d(Xi, X) > δ #

i=1 Wn,i(X) µ (Xi) µ (X) 2 G I d(Xi, X) δ #

4 sup u X µ(u) 2 G E n X

i=1 Wn,i(X) I d(Xi, X) > δ

+ sup u,v X : d(u,v) δ µ (u) µ (v) 2 G .

holds. We know that function µ is uniformly continuous, hence for any ε > 0 one can choose δ > 0 such that

sup u,v X : d(u,v) δ µ (u) µ (v) 2 G < ε

hence the second term is less than ε/2. In addition, by condition (iii), for a given δ if n is large enough, then the ﬁrst term is also less than ε/2. In conclusion J(2) n converges to 0. For the third term we have that

|J(3) n | = E n X

i=1 Wn,i(X) µ (X) µ (X) 2 G

= E µ (X) µ (X) 2 G < ε.

For J(1) n the application of condition (ii) to function f(x) .= µ (x) µ (x) 2 G yields

i=1 Wn,i(X) µ (Xi) µ (Xi) 2 G

c E h µ (X) µ (X) 2 G i = c ε.

Therefore, we can conclude that Jn tends to 0, as n .

Lemma 8 Under the conditions of Theorem 6, we have that

i=1 Wn,i(X) ℓ( , Yi) µ (Xi)

Tam as and Cs aji

Proof For simplicity let

i=1 Wn,i(X) ℓ( , Yi) µ (Xi)

j=1 Wn,i(X)Wn,j(X) ℓ( , Yi) µ (Xi), ℓ( , Yj) µ (Xj) G

For i = j we have that

E Wn,i(X)Wn,j(X) ℓ( , Yi) µ (Xi), ℓ( , Yj) µ (Xj) G

= E E h Wn,i(X)Wn,j(X) ℓ( , Yi) µ (Xi), ℓ( , Yj) µ (Xj) G | X, X1, . . . Xn, Yi i

= E Wn,i(X)Wn,j(X) ℓ( , Yi) µ (Xi), E h ℓ( , Yj) | X, X1, . . . Xn, Yi i µ (Xj)

= E Wn,i(X)Wn,j(X) ℓ( , Yi) µ (Xi), E ℓ( , Yj) | Xj µ (Xj)

Furthermore, by E[ℓ(Y, Y )] < we have

E σ2(X) < (38)

for σ2(X) .= E ℓ( , Y ) µ (X) 2 G | X . By (37) we have

i=1 W 2 n,i(X) ℓ( , Yi) µ (Xi) 2 G

= E E h n X

i=1 W 2 n,i(X) ℓ( , Yi) µ (Xi) 2 G X, X1, . . . , Xn i

i=1 W 2 n,i(X)E h ℓ( , Yi) µ (Xi) 2 G X, X1, . . . , Xn i

i=1 W 2 n,i(X)E h ℓ( , Yi) µ (Xi) 2 G Xi i

i=1 W 2 n,i(X)σ2(Xi) i .

If σ2 is bounded, then In tends to zero almost surely by condition (iv). Otherwise let σ2 L be such that E | σ2(X) σ2(X)| < ε, (40)

see (Gy orﬁet al., 2002, Theorem A.1). Then

i=1 W 2 n,i(X) σ2(Xi) + E h n X

i=1 W 2 n,i(X) σ2(Xi) σ2(Xi) i

W 2 n,i(X) i + E h n X

i=1 Wn,i(X) σ2(Xi) σ2(Xi) i ,

Recursive Estimation of Conditional Kernel Mean Embeddings

where the ﬁrst term tends to zero by condition (iv) and for the second term we have

i=1 Wn,i(X) σ2(Xi) σ2(Xi) i c E σ2(X) σ2(X) c ε (42)

by condition (ii). In conclusion, In converges to 0.

3.3 Recursive Estimation of Conditional Kernel Mean Embeddings

Let (X, d) be a locally compact Polish space, as before. In this section, we apply local averaging smoothers or kernels on the input space (Gy orﬁet al., 2002). These kernels are not necessarily positive deﬁnite, therefore, to avoid misunderstanding, we call them smoothers. Nevertheless, many positive deﬁnite functions are included in the deﬁnition.

Deﬁnition 9 (smoother) A function k : X X R is called a smoother, if it is measurable, non-negative, symmetric, and bounded.

Now, we introduce a recursive local averaging estimator. Let (kn)n N be a smoother sequence and (an)n N be a stepsize (learning rate or gain) sequence of positive numbers. Let us consider the following recursive algorithm:

µ1(x) .= ℓ( , Y1), and

µn+1(x) .= 1 an+1kn+1(x, Xn+1) µn(x) + an+1kn+1(x, Xn+1)ℓ( , Yn+1), (43)

for n 1. By induction, one can see that µn is of the form deﬁned in (19). It is clear that one only needs to store µn, because in each iteration, when a new observation pair is available, one can immediately update µn by the recursion deﬁned in (43).

In practice, we often need to evaluate µ (x) on a function g G, i.e., the conditional expectation E[ g(Y ) | X = x] is sought. Then, our recursion simpliﬁes to

µ1(x), g G = ℓ( , Y1), g G = g(Y1),

µn+1(x), g G = 1 an+1kn+1(x, Xn+1) µn(x) + an+1kn+1(x, Xn+1)ℓ( , Yn+1), g G

= 1 an+1kn+1(x, Xn+1) µn(x), g G + an+1kn+1(x, Xn+1)g(Yn+1),

which can be computed in linear time with constant memory. The real-valued consistency of this method can be established, e.g., by (Gy orﬁet al., 2002, Theorem 25.1), however, we usually do not know in advance which function g G is of our interest, hence estimating the CMKE is very useful. Moreover, knowing the whole conditional distribution could also be used for various inference tasks, e.g., hypothesis testing or building prediction regions.

Theorem 10 presents suﬃcient conditions for the weak and strong consistency of our recursive scheme. It is a generalization of Theorem 25.1 of Gy orﬁet al. (2002). There are two novelties w.r.t. the result in (Gy orﬁet al., 2002). We prove consistency for Hilbert space valued regression and we cover locally compact Polish input spaces. The smoother functions are controlled on an intuitive and ﬂexible manner. They do not need to admit a particular form or be monotonous. Intuitively the smoother functions should shrink around the observed input points as n goes to inﬁnity. The rate of this decrease is controlled by stepsizes (hn) and (rn). The proposed conditions are w.r.t. a general majorant function H.

Tam as and Cs aji

Theorem 10 Let (X, d) be a locally compact Polish space. Assume that A1, A2 and the following conditions are satisﬁed.

(i) There exist sequences (hn)n N and (rn)n N of positive numbers tending to 0 and a non-negative, nonincreasing function H on [0, ) with 1/rn H(s/hn) 0 (n ) for all s (0, ).

(ii) For all n N it holds that

rn H d(x, z)/hn . (45)

(iii) The supremum of additive weights is bounded by 1, i.e.,

sup x,z X,n N ankn(x, z) 1. (46)

(iv) Measure QX does not diminish w.r.t. the smoother functions, that is

kn(x, t) d QX(t) > 0 for QX-almost all x. (47)

(v) The positive learning rates satisfy

n=1 an = , and

a2 n r2n < . (48)

Then, the estimate sequence (µn)n N deﬁned by (43) is weakly and strongly consistent.

Conditions (i) and (ii) ensure that kn(x, z) vanishes for x = z as n . The weight functions are probability weights because of assumption (iii). Condition (iv) is the strongest technical assumption in a general metric measure space, however, in several cases it is easy to verify. Condition (v) can be usually guaranteed by the choice of stepsizes.

Proof For the weak consistency, we apply Theorem 6, hence our main task is to verify its assumptions. Observe that with W1,1(x) = 1 and

1 alkl(x, Xl) aiki(x, Xi), (49)

for n 2 and i [n] we have µn(x) = Pn i=1 Wn,i(x)ℓ( , Yi). By induction, it is easy to see that the weights are probability weights. We start the procedure with a constant 1 function, then an update clearly does not change the sum of the weights for any input point x X. Condition (22) is veriﬁed in the proof of Gy orﬁet al. (2002, Theorem 25.1). Note that (22) only regards the weight functions, which are identical for real-valued regression. Condition (24) again follows from the scalar-valued case. In order to prove that condition (23) holds, we follow the proof in (Gy orﬁet al., 2002, Theorem 25.1) with necessary modiﬁcations. By using independence of {Xi}n i=1 and the inequality (1 x) e x, we have for all i n that

Recursive Estimation of Conditional Kernel Mean Embeddings

E Wn,i(x) =

l=i+1 E (1 alkl(x, Xl)) E aiki(x, Xi)

ri e Pn l=i+1 al E kl(x,Xl) n 0,

because Pn l=i+1 al E kl(x, Xl) n by condition (iv) and P n=1 an = . Let us ﬁx δ > 0. By the dominated convergence theorem, it is suﬃcient to prove that for almost all x it holds that E n X

i=1 Wn,i(x)I d(x, Xi) > δ 0. (51)

Let pn(x) .= E[kn(x, X)]. By condition (iv) we have that

2 lim inf n pn(x) > 0 (52)

holds QX-almost everywhere implying that for QX-almost all x there exists n0(x) such that, for n > n0(x) we have pn(x) p(x). Because of (50), for such x X it is suﬃcient to prove

i=n0(x) Wn,i(x)I d(x, Xi) > δ n 0. (53)

By independence one can observe that

E h Wn,i(x)I d(x, Xi) > δ i = E n Y

l=i+1 (1 alkl(x, Xl))aiki(x, Xi)I d(x, Xi) > δ

l=i+1 (1 alkl(x, Xl)) E aiki(x, Xi)I d(x, Xi) > δ

= E Wn,i(x)

E aiki(x, Xi) E aiki(x, Xi)I d(x, Xi) > δ ,

therefore, we have

i=n0(x) Wn,i(x)I d(x, Xi) > δ

i=n0(x) E Wn,i(x) E aiki(x, Xi)I d(x, Xi) > δ

E aiki(x, Xi)

i=n0(x) E Wn,i(x) H(δ/hi)

Since Pn i=n0(x) E Wn,i(x) 1, E Wn,i(x) 0 and 1/rn H(δ/hn) 0 as n the application of Lemma 29 yields

i=n0(x) E Wn,i(x) H(δ/hi)

ri p(x) 0. (56)

Tam as and Cs aji

Consequently, the weak consistency of (µn) follows from Theorem 6. The strong consistency of (µn)n N can be proved based on the almost supermartingale convergence theorem of Robbins and Siegmund (1971), see Theorem 30. The technical details of the proof are presented in Appendix C.

We presented our recursive algorithm in a general setup. It remains to construct smoother sequences and stepsizes that satisfy the proposed conditions. In the theorem that follows we generate the smoothers with the help of an R R type mother kernel K. In the next section we present several corollaries of this theorem to design consistent algorithms for many important machine learning applications.

Theorem 11 Let (X, d) be a locally compact Polish space and K : R R be a measurable, nonnegative and bounded function. Assume A1 and A2. Let (hn)n N and (rn)n N be positive sequences with zero limit. Let us deﬁne the following smoothers

kn(x, z) .= 1

rn K d(x, z)

for n N. Let (µn)n N be deﬁned as in (43). Assume that:

(i) There are positive constants R and b such that K(t) b I(t B(0, R)) for all t R,

where B(0, R) denotes the open ball with center 0 and radius R.

(ii) There is a measurable, nonnegative and nonincreasing function H on [0, ) such that 1/rn H(s/hn) 0 as n and K(s) H(s) for all s (0, ).

(iii) The nonnegative learning rates satisfy

n=1 an = , and

a2 n r2n < . (58)

(iv) The weights of the update term is bounded by 1, i.e.,

sup t R K(t) sup n N

(v) The probability of small balls is bounded from below by

lim inf n QX(B(x, R hn))

rn > 0 QX - a.s. (60)

Then, (µn)n N is weakly and strongly consistent.

Proof It is suﬃcient to verify the conditions of Theorem 10. We assumed that

1/rn H(s/hn) 0, (61)

for all s (0, ). From (57) and condition (ii) it follows that

rnkn(x, z) = K(d(x, z)/hn) H(d(x, z)/hn). (62)

Recursive Estimation of Conditional Kernel Mean Embeddings

Condition (iv) implies that

sup x,z X,n N ankn(x, z) = sup x,z X,n N

an rn K d(x, z)

sup t R K(t) sup n N

holds. Finally by condition (i) we have Z

X kn(x, t) d QX(t) =

X 1 rn K d(x, t)

X b rn I d(x, t)

hn B(0, R) d QX(t)

X b rn I t B(x, R hn) d QX(t) = b QX(B(x, R hn))

Taking the lim inf on both sides and applying condition (v) yield (47). The conditions on the stepsize sequence are assumed. Therefore, by applying Theorem 10, we can conclude that the estimate sequence, (µn)n N, is weakly and strongly consistent.

4. Demonstrative Applications

In this section we demonstrate the applicability of Theorem 11 through some characteristic model classes important for statistical learning. We consider three general setups. The main diﬀerence of these applications lies in the input spaces. We show that the requirement on the probability of small balls is satisﬁed in these cases and provide constructions for the learning rates, (an), (rn) and (hn), such that they satisfy the conditions of Theorem 11. First, Euclidean input spaces are considered, for which we prove the weak and strong universal consistency of the recursive estimator of the conditional kernel mean map. The main novelty of this corollary compared to the results of Gy orﬁet al. (2002, Chapter 25) is that our method is applicable for special Hilbert space valued regression problems. Second, we study abstract Riemannian manifolds as input spaces for which an embedding into an ambient Euclidean space may not be explicitly given. One of the motivations of this example comes from manifold learning (Tenenbaum et al., 2000; Lin and Zha, 2008), which is an actively studied area in machine learning, whose theoretical study is full of challenges. The presented recursive framework can oﬀer a new technical tool for this direction. Finally, we investigate the case of functional inputs (Ferraty and Vieu, 2006). This example is motivated by signal processing problems and methods for continuous-time systems in time-series analysis. We consider a wavelet-like approach leading to an input space that is a locally compact subset of L2(R) induced by translations of a suitable mother function.

4.1 Euclidean Spaces

When X is an Euclidean space, the weak and strong universal consistency of the recursive estimator of the conditional kernel mean map is a consequence of Theorem 11.

Corollary 12 Let X = Rp be a p-dimensional Euclidean space, with the Euclidean metric, and K : R R be a measurable, nonnegative and bounded function. Let the stepsizes be

Tam as and Cs aji

deﬁned as an .= 1/n, and let rn .= 1/n(1 ε)/2, where ε (0, 1), and hn .= p rn and kn be as in (57) for n N. Let (µn)n N be deﬁned as in (43). Assume A1, A2 and conditions (i) and (ii) from Theorem 11, then (µn)n N is weakly and strongly universally consistent.

Proof We can assume (w.l.o.g.) that K is bounded by 1. A Euclidean space is a locally compact Polish space, therefore, we can apply Theorem 11. Conditions (i) and (ii) are assumed. Our choice of stepsizes clearly satisﬁes (58) and (59). For (60) we use (Devroye, 1981, Lemma 2.2) which proves that for every probability measure QX on Rp there exists a nonnegative function g satisfying g(x) < QX-almost surely, such that

hp n QX(B(x, R hn)) g(x) QX a.s. (65)

for all R > 0. Therefore

lim inf n QX(B(x, R hn))

rn = 1 g(x) > 0 QX a.s. (66)

and then the statement of corollary is proved by applying Theorem 11.

Remark 13 Mother kernel K can be chosen based on prior knowledge on the problem. In the table that follows we list some of the possible candidates.

Name Formula

Box kernel I(|x| < B), with B > 0

Gaussian kernel exp x2

, with σ > 0

Laplace kernel exp |x|

, with σ > 0

Epanechnikov kernel 3/4(1 x2) I(|x| < 1)

One can easily verify that all of these functions are measurable, nonnegative and bounded by 1. They also satisfy conditions (i) and (ii) from Theorem 11 with H = K|[0, ).

4.2 Riemannian Manifolds

In our recursive scheme the power of hn w.r.t. rn is (at least) p for a p-dimensional Euclidean space. Manifolds can often be embedded in a high-dimensional Euclidean space, however, they are locally isomorphic to a lower dimensional one. This local isometry allows us to use the lower manifold dimension as the power of hn w.r.t. rn. Moreover, we also cover abstract p-dimensional manifolds, hence an ambient Euclidean space is not required.

Corollary 14 Let X be a p-dimensional connected, geodesically complete Riemannian manifold, with geodesical distance ϱ and K : R R be a measurable, nonnegative and bounded function. Let an .= 1/n, rn .= 1/n(1 ε)/2, where ε (0, 1), and hn = p rn and kn be as in (57) for n N. Let (µn)n N be deﬁned as in (43). Assume A1, A2 and conditions (i) and (ii) from Theorem 11, then (µn)n N is weakly and strongly universally consistent.

Recursive Estimation of Conditional Kernel Mean Embeddings

Proof Assume (w.l.o.g.) that K is bounded by 1. Every topological manifold is locally compact (Lee, 2013, Proposition 1.12). Connected Riemannian manifolds are metrizable, henceforth paracompact2 (Stone, 1948). A connected paracompact metric space is second countable, thus it is separable and admits a countable atlas. Moreover, by the Hopf-Rinow theorem a geodesically complete Riemannian manifold is complete as a metric space. Consider a countable smooth atlas A on X. We are going to prove the assumption about small ball probabilities for every chart in the atlas almost surely, hence the claim for the whole space will follow almost surely. Consider a chart (U, f) in A. If QX(U) = 0, then the condition is satisﬁed automatically, otherwise consider the probability measure QX/QX(U) on U. We push measure QX/QX(U) to Rp, i.e., consider the probability measure µ on Rp, deﬁned by µ(A) .= QX f 1[f(U) A]/QX(U) for every Borel set A in Rp. We have already seen that

lim inf n µ(B2(s, R hn))

hp n > 0 µ almost everywhere, (67)

that is there exists a measurable set Ω1 Rp such that µ(Ω1) = 1 and for all s Ω1 one has (67). Notice that one can assume that Ω1 f(U) (otherwise we can take the intersection). Let q Ω1. We know that f 1 is a smooth diﬀeomorphism, therefore there exists C > 0 such that Df 1(s) C for s B2(q, 1/C hn), where D is the diﬀerential operator and denotes the operator norm. In the next step, we prove that Bϱ(f 1(q), hn) f 1[B2(q, 1/C hn) f(U)] for hn small enough. Let s B2(q, 1/C hn) f(U). Then, there are x = f 1(q) and x .= f 1(s). By Lemma 32, we have

ϱ(x, x) = ϱ(f 1(q), f 1(s)) C q s 2 hn. (68)

It follows that

QX(Bϱ(f 1(q), hn)) QX f 1(B2(q, 1/C hn) f(U)) = µ(B2(q, 1/C hn)) QX(U). (69)

Let Ω0 = f 1[Ω1] X, for which QX(Ω0)/QX(U) = 1. Moreover, for x Ω0 we have that

lim inf n QX(Bϱ(x, hn))

hdn lim inf n QX(U)µ(B2(f(x), 1/C hn))

hdn > 0. (70)

holds for every x Ω0. Hence (70) holds QX/QX(U)-almost surely and QX-almost everywhere. The rest of the proof is similar to that of the Euclidean case.

4.3 Locally Compact Subset of a Hilbert Space

Now, we consider functional inputs, which can be useful, e.g., in signal processing. The analyzed function set is motivated by wavelets. Our model class can be seen as how a signal is encoded in a computer by storing a ﬁnite number of coeﬃcients for some basis functions at given knots, where the locations of these knots are ﬂexible and part of the representation.

2. A topological space is paracompact, if every open cover has an open reﬁnement that is locally ﬁnite.

Tam as and Cs aji

Let us consider the Hilbert space L2(R). Let ψ be a function in a Sobolev space W 1,2(R) (Berlinet and Thomas-Agnan, 2004), and τx be the translation operator on L2(R) deﬁned pointwise by τxf(t) .= f(t x). For given (constant) hyper-parameters m N, M > 0, let

M .= f : R R f =

i=1 λi τxiψ for some λ1, . . . , λm, x1, . . . , xm [ M, M] . (71)

Consider the canonical (measurable) mapping φ from [ M, M]2m to M given by

i=1 λi τxiψ, (72)

where λ = [λ1, . . . , λm]T and x = [x1, . . . , xm]T. Clearly, φ is surjective, however, it is not injective. We are going to consider probability measures of the form QX .= PX φ 1 for some PX on [ M, M]2m, i.e., we assume that QX is the pushforward of some measure on the bounded (locally compact) Polish space [ M, M]2m with the Euclidean metric.

Corollary 15 Let ψ W 1,2(R), X = M with the L2 metric, K : R R be a measurable, nonnegative and bounded function. Let an .= 1/n, rn .= 1/n(1 ε)/2, for an ε (0, 1), and hn = 2m rn and kn be as in (57) for n N. Let (µn)n N be deﬁned as in (43). Assume (i)-(ii) from Theorem 11, then (µn)n N is weakly and strongly consistent for measures of the form QX .= PX φ 1 for any distribution PX on [ M, M]2m, where φ is given by (72).

Proof We need to prove that M is locally compact and Polish. First, we show that M is complete with the L2 metric. It is known that L2 is complete. Hence, for any Cauchy sequence (fn)n N in M we have a limit function f L2(R). We show that f is also in M. Let fn = Pm i=1 λn,i τxn,iψ. We know that [ M, M] is a compact set hence for i = 1 there is a subsequence (nk)k N of N such that λnk,1 λ1. One can iteratively ﬁnd subsequences of the resulting subsequences such that for all i = 1, . . . , m the convergences of λsk,i λi and xsk,i xi hold true for a subsequence (sk)k N of N. Because of ψ W 1,2(R) function ψ is (Lebesque) almost everywhere continuous. Hence, fsk ˆf holds almost everywhere for ˆf .= Pn i=1 λiτxiψ. It is known, see (Rudin, 1987, Theorem 3.12), that this is suﬃcient for f = ˆf almost everywhere. A similar argument can be used to show that M is sequentially compact. Then, since M is a metric space, it is also compact; not just locally, but globally.

We also need to prove that for all f M

lim inf n QX(B(f, R hn))

h2m n > 0 QX-a.s. (73)

We show that for f .= Pm i=1 λiτ xiψ from

# 2 < h (74)

it follows that f f 2 < C h. Recall that on ﬁnite dimensional vector spaces all norms are equivalent, thus one can use instead of 2, i.e., there exists c > 0 such that

Recursive Estimation of Conditional Kernel Mean Embeddings

holds. In conclusion, if (74) holds, then

# c h. (76)

We use the notion of moduli of continuity to prove that f f 2 < C h. Recall that

ω(2) ψ (h) .= sup |x| h

Z |ψ(t + x) ψ(t)|2 dt. (77)

It is known (Evans, 2010, Section 5.8,Theorem 3) that for ψ W 1,2(R) there is a constant Mψ with ω(2) ψ (h) Mψh2. (78)

Using the triangle inequality and (78) one can gain

i=1 λiτ xiψ

i=1 λiτxiψ +

i=1 λiτ xiψ

i=1 λiτ xiψ

i=1 |λi λi| τxiψ 2 +

i=1 | λi| τxiψ τ xiψ 2

h c m ψ 2 + M

R |ψ(t xi) ψ(t xi)|2 dt 1/2

h c m ψ 2 + M

R |ψ(s) ψ(s + (xi xi))|2 dt 1/2

h c m ψ 2 + M

ω(2) ψ (c h)

h c m ψ 2 + M q

hence our claim is satisﬁed with C = m c ψ 2 + M p Mψ . We showed that B(f, C hn))) φ(B([λT, x T]T, hn)), therefore

QX(B(f, R hn))) PX φ 1 φ(B([x T, λT]T, R/C hn))

PX(B([λT, x T]T, R/C hn)). (80)

By (Devroye, 1981, Lemma 2.2) we conclude that

lim inf n PX(B([λT, x T]T, R/C hn))

h2m n = 1 g([λT, x T]T) > 0 PX a.s. (81)

Tam as and Cs aji

lim inf n QX(B(f, R hn))

h2m n > 0 QX a.s. (82)

By Theorem 11 the corollary follows.

5. Discussion

In this paper, we have introduced a new recursive estimator for the conditional kernel mean map, which is a key object to represent conditional distributions in Hilbert spaces. We have proved the weak and strong L2 consistency of the introduced scheme in the Bochner sense, under general structural conditions. We have considered a locally compact Polish input space and assumed only the knowledge of a kernel on the (measurable) output space. Our recursive algorithm uses a smoother sequence on the metric input space and three learning sequences, each of these having a particular role in the iterative settings. Learning rate (an) adjusts the weights of the current estimate term and the update term in each step. Sequences (hn) and (rn) are used to adaptively shrink the smoother functions w.r.t. the intrinsic measuretic structure of the metric space. Our main assumption concerns the probabilities of small balls, i.e., we assume that the measure of small balls with radius hn are comparable to rn when n goes to inﬁnity. We have generalized a result of Gy orﬁ et al. (2002, Theorem 25.1) for locally compact Polish input spaces and Hilbert space valued regressions. We have considered three important application domains of our recursive estimation scheme. Our consistency results are universal in case of Euclidean input spaces and complete Riemannian manifolds, because the small ball probability assumption can be satisﬁed by (hn) and (rn) for all probability distributions. A Hilbert space valued input set (containing functions) was also considered to illustrate the generality of the framework.

Acknowledgements

The authors warmly thank L aszl o Gerencs er for his many comments and suggestions regarding the paper. His advises helped us to shape the ﬁnal manuscript in many ways. The research was supported by the European Union project RRF-2.3.1-21-2022-00004 within the framework of the Artiﬁcial Intelligence National Laboratory. This work has also been supported by the TKP2021-NKTA-01 grant of the National Research, Development and Innovation Oﬃce (NRDIO), Hungary. The authors acknowledge the professional support of the Doctoral Student Scholarship Program of the Cooperative Doctoral Program of the Ministry of Innovation and Technology, as well, ﬁnanced from an NRDIO Fund.

Appendix A.

The Bochner Integral

Bochner integral is an extended integral notion for Banach space valued functions. Let (G, G) be a Banach space with the Borel σ-ﬁeld of G and (Ω, A, P) be a measure space with a ﬁnite measure P. Similarly to the Lebesgue integral, the Bochner integral is introduced ﬁrst for simple Ω G type functions of the form f(ω) = Pn j=1 I(ω Aj) gj, where Aj A

Recursive Estimation of Conditional Kernel Mean Embeddings

and gj G for all j [n], by Z

j=1 P(Aj) gj. (83)

Then, the integral is extended to the abstract completion of these simple functions with respect to the norm deﬁned by the following Lebesgue integral

f L1(P;G) .= Z

Ω f G d P. (84)

Deﬁnition 16 A function f : Ω G is called strongly measurable (Bochner-measurable) if there exists a sequence (fn)n N of simple functions converging to f almost everywhere.

By the Pettis theorem (Hyt onen et al., 2016, Theorem 1.1.20), this is equivalent to requiring the weak measurability of f and that f(Ω) G is separable.

Deﬁnition 17 Let 1 p < . Then, the vector valued Lp(Ω, A, P; G)-space is deﬁned as

Lp(Ω, A, P; G)= L(P; G) .= n f : Ω G f is strongly measurable and Z f p G d P < o . (85)

A semi-norm on Lp(Ω, A, P; G) is deﬁned by

f Lp(P;G) .= Z f p G d P 1

p, for f Lp(Ω, A, P; G). (86)

Taking equivalence classes on Lp(Ω, A, P; G) w.r.t. semi-norm Lp yields the vector valued Lp space (Bochner space) denoted by Lp(Ω, A, P; G), or simply by Lp(Ω, P; G).

Proposition 18 Space (Lp(Ω, A, P; G), Lp(P;G)) is a Banach space for 1 p < .

Several of the well-known theorems about Lebesgue integrals can be extended to Bochner integrals, see (Dinculeanu, 2000; Hyt onen et al., 2016; Pisier, 2016).

Proposition 19 (Bochner s inequality) For all f L1(Ω, A, P; G), we have

Z f d P G Z f G d P. (87)

Moreover, f is Bochner integrable if and only if the right hand side is ﬁnite, however, recall that f is required to be strongly measurable to be included in L1(Ω, A, P; G). Furthermore, observe that R f G d P = f L1(P;G), hence the Bochner integral is a continuous operator from L1(Ω, A, P; G) to Banach space G. By deﬁnition, the Bochner integral is also linear.

Proposition 20 Let E and F be Banach spaces and T : E F be a bounded linear operator. Then, for all f L1(P; E), we have that T f L1(P; F) and

T Z f d P = Z T(f) d P. (88)

In particular, for a bounded linear functional x : G R, we have D Z

Ω f(ω) d P(ω), x E = Z

Ω f(ω), x d P(ω), (89)

where g, x .= x (g).

Tam as and Cs aji

Now, let (G, , G) be a Hilbert space. By Proposition 20 for all g G, we have

Ω f(ω) d P(ω), g E

Ω µ(ω), g G d P(ω). (90)

Moreover, it can be proved that L2(Ω, A, P; G) is also a Hilbert space with the following inner product

µ1, µ2 L2(Ω,P;G) .= Z µ1(ω), µ2(ω) G d P(ω). (91)

Conditional Expectation in Banach Spaces

Let P be a probability measure and B a sub-σ-algebra of A. The conditional expectation with respect to B is deﬁned for any f L1(Ω, A, P; G) as the P-a.s. unique element E f | B .= ν L1(Ω, B, P; G) satisfying Z

B ν d P = Z

B f d P, for all B B. (92)

It can be shown that the conditional expectation is a (well-deﬁned) bounded operator from L1(Ω, A, P; G) to L1(Ω, B, P; G) (Hyt onen et al., 2016, Theorem 2.6.23), (Pisier, 2016, Proposition 1.4). The nonexpanding property also holds in the Lp(P; G) sense for all 1 p < , see (Hyt onen et al., 2016, Corollary 2.6.30), that is for all f Lp(P; G) we have

E f | B p G E f p G | B , (93)

and by Bochner s inequality E f | B Lp(Ω;G) f Lp(Ω;G). Furthermore, by (Hyt onen et al., 2016, Prop. 2.6.31.) we can take out the B-measurable term from the conditional expectation. In particular, if G is a Hilbert space, then for all g G and f L2(Ω, A, P; G) we (a.s.), have that E µ | B , g G = E µ, g G | B . (94)

Appendix B.

In the proof of the generalized Stone s theorem a continuous approximation of the conditional kernel mean map is needed. The existence of such approximation is ensured by the following theorem.

Theorem 21 Let X be a locally compact Hausdorﬀspace equipped with a Radon measure 3

QX and let G be a separable Hilbert space. The space of compactly supported continuous functions, Cc(X; G), is dense in L2(X, QX; G), that is for all µ L2(X, QX; G) and ε > 0, there exists a µ Cc(X; G) such that Z

X µ(x) µ(x) 2 G d QX(x) < ε. (95)

A similar result is proved for Rd in (Hyt onen et al., 2016, Lemma 1.2.31). They mention in a remark that their lemma can be generalized for locally compact spaces, but the complete

3. A Radon measure is a regular Borel measure.

Recursive Estimation of Conditional Kernel Mean Embeddings

proof of their claim were not found by us in the literature. Nevertheless, the presented argument has a similar ﬂavor as the ones for scalar valued Lp spaces.

Proof First, we prove that we can approximate µ with simple functions in L2(X, QX; G). Let µ L2(X, QX; G) and ε > 0. There exists a sequence of simple functions (µn)n N such that µn µ a.s., i.e., µn(x) .= Pkn i=1 gi I(x Ai) where gi G and Ai is measurable with QX(Ai) < for all i [kn]. Let Bn .= {x X : µn(x) G 2 µ(x) G} and

hn(x) .= I(x Bn) µn(x). (96)

One can see that the (Bn)n N sequence consists of measurable sets and hn µ almost everywhere. We show that hn µ in L2(X, QX; G). Because of

hn 2 L2(X;G) = Z

Bn µn(x) 2 G d QX(x) 22 Z

µ(x) 2 G d QX(x), (97)

we have hn L2(X, QX; G) for n N. In addition, almost everywhere we have

µ(x) hn(x) 2 G (3 µ(x) G)2, (98)

therefore, by the dominated convergence theorem

lim n µ hn 2 L2(X,QX;G) = lim n

X µ(x) hn(x) 2 G d QX(x)

X lim n µ(x) hn(x) 2 G d QX(x) = 0. (99)

In conclusion, for any ε > 0 there exists a simple function h L2(X, QX; G) such that µ h L2(X,QX;G) < ε. Therefore, it is suﬃcient to approximate h with a continuous function on a compact support. Let h L2(X, QX; G) be

i=1 gi I(x Ai), (100)

where gi G and QX(Ai) < for i [N]. The main idea is to approximate the indicator functions, {gi I(x Ai)}n i=1, separately. Let us ﬁx i = 1 and ε > 0. Since QX is Radon and A1 is measurable there exists a compact set K X such that K A1 and

QX(A1 \ K) < ε2

2N2 g1 2 G , (101)

and there exists an open set U such that A1 U with

QX(U \ A1) < ε2

2N2 g1 2 G . (102)

The application of Lemma 25 yields that there exists an open set E such that E is compact and K E E U. Because of Lemma 26 there is a continuous function f1 such

Tam as and Cs aji

that f1|K = 1 and f1|X\E = 0, from which it follows that f1 has a compact support. For h1(x) .= g1 I(x A1) and h1(x) .= g1 f1(x) we have Z

h1(x) h1(x) 2

G d QX(x) = Z

X g1 I(x A1) g1 f1(x) 2 G d QX(x)

I(x A1) f1(x) 2 d QX(x) g1 2 G

E\K 1 d QX(x)

= g1 2 G QX(E \ K) g1 2 G QX(U \ K) < ε2

Similarly, one can construct hi which is close to hi in L2(X, QX; G) for all i [N]. Let µ .= PN i=1 hi. Obviously, µ is continuous and has a compact support. Furthermore, by the triangle inequality we have

X h(x) µ(x) 2 G d QX(x) =

i=1 gi I(x Ai) fi(x) 2

gi I(x Ai) fi(x) 2 G d QX(x)

which ﬁnishes the proof of the theorem.

Lemma 22 Let {(Xi, Xi)}3 i=1 be measurable spaces, let Xi, for i {1, 2, 3}, be Xi-valued independent random elements on a probability space (Ω, A, P) with push-forward measure Qi. Let f : X1 X2 X3 R and g : X1 X2 R be measurable functions such that f(X1, X2, X3) and g(X1, X2) are in L1(Ω, P). If for Q1-almost all x X1 it holds that

E f(x1, X2, X3) | X2 g(x1, X2), (105)

then almost surely

X1 f(x1, X2, X3) d Q1(x1) X2

X1 g(x1, X2) d Q1(x1). (106)

Proof Integrating out both sides in (105) w.r.t. Q1 yields Z

X1 E h f(x1, X2, X3) | X2 i d Q1(x1) Z

X1 g(x1, X2) d Q1(x1) = E[g(X1, X2)]. (107)

In addition, the left hand side is

X f(x1, X2, X3) d Q1(x1) X2 i = Z

X1 f(x1, X2, x3) d Q1(x1) d Q3(x3)

X3 f(x1, X2, x3) d Q3(x3) d Q1(x1) = Z

X1 E h f(x1, X2, X3) | X2 i d Q1(x1), (108)

because of Fubini s theorem.

Recursive Estimation of Conditional Kernel Mean Embeddings

The details of the proof for strong consistency in Theorem 10 are presented in this section.

Proof By the weak consistency of (µn)n N, Fubini s theorem and (93) we have Z E[µn(x)] µ (x) 2 G d QX(x) E Z µn(x) µ (x) 2 G d QX(x) 0. (109)

Because of (a + b)2 2a2 + 2b2, we have

E Z µn(x) E[µn(x)] 2 G d QX(x)

2 E Z µn(x) µ (x) 2 G d QX(x) + Z µ (x) E[µn(x)] 2 G d QX(x) 0. (110)

Since Z µn(x) µ (x) 2 G d QX(x)

2 Z µn(x) E[µn(x)] 2 G d QX(x) + Z µ (x) E[µn(x)] 2 G d QX(x) (111)

holds, for the strong consistency of (µn) it is suﬃcient to prove that Z µn(x) E[µn(x)] 2 G d QX(x) 0 (112)

almost surely. By (110) the integral admits a ﬁnite limit which must agree with the weak limit (which is zero). Let us consider the following expansion:

µn+1(x) E[µn+1(x)] = µn(x) E[µn(x)]

an+1 µn(x)kn+1(x, Xn+1) E µn(x)kn+1(x, Xn+1)

+ an+1 ℓ( , Yn+1)kn+1(x, Xn+1) E ℓ( , Yn+1)kn+1(x, Xn+1) .

Let Fn .= σ(X1, Y1, . . . , Xn, Yn) for n N. Taking the conditional expectation of the norm square of (113) w.r.t. Fn yields

E h µn+1(x) E[µn+1(x)] 2 G | Fn i = µn(x) E[µn(x)] 2 G

+ a2 n+1E h µn(x)kn+1(x, Xn+1) E µn(x)kn+1(x, Xn+1) 2 G | Fn i

+ a2 n+1E h ℓ( , Yn+1)kn+1(x, Xn+1) E ℓ( , Yn+1)kn+1(x, Xn+1) 2 G | Fn i

2an+1E h µn(x) E[µn(x)], µn(x)kn+1(x, Xn+1) E[µn(x)kn+1(x, Xn+1)]

+ 2an+1E h µn(x) E[µn(x)], ℓ( , Yn+1)kn+1(x, Xn+1) E[ℓ( , Yn+1)kn+1(x, Xn+1)]

2a2 n+1E h µn(x)kn+1(x, Xn+1) E[µn(x)kn+1(x, Xn+1)],

ℓ( , Yn+1)kn+1(x, Xn+1) E[ℓ( , Yn+1)kn+1(x, Xn+1)]

Tam as and Cs aji

= I1 + I2 + I3 + I4 + I5 + I6, (114)

where Ii are deﬁned respectively for i {1, . . . , 6}. We bound these terms separately using the measurability condition, the independence of the sample and condition (ii). Our main tool to prove almost sure convergence uses almost supermartingales, Theorem 30 from (Robbins and Siegmund, 1971). Therefore, our main task in this section is to bound each Ii for i {1, . . . , 6} by (1 + αn) µn(x) E[µn(x)] 2 G + βn where αn and βn are small enough, i.e., X

n N E[αn] < and X

n N E[βn] < (115)

holds. Clearly, I1 is bounded by itself. For the second term we have

I2 .= a2 n+1E h µn(x)kn+1(x, Xn+1) E µn(x)kn+1(x, Xn+1) 2 G | Fn i

E h µn(x)kn+1(x, Xn+1) 2 G | Fn i + E[µn(x)]E[kn+1(x, Xn+1)] 2 G

2 µn(x)E[kn+1(x, Xn+1)], E[µn(x)]E[kn+1(x, Xn+1)] G

µn(x) 2 G E[k2 n+1(x, Xn+1)] E[kn+1(x, Xn+1)] 2

+ µn(x) 2 G E[kn+1(x, Xn+1)] 2 2 µn(x), E[µn(x)] G E[kn+1(x, Xn+1)] 2

+ E[µn(x)] 2 G E[kn+1(x, Xn+1)] 2

µn(x) 2 G E[k2 n+1(x, Xn+1)] E[kn+1(x, Xn+1)] 2

+ µn(x) E[µn(x)] 2 G E[kn+1(x, Xn+1)] 2

H2(0)a2 n+1 r2 n+1

µn(x) 2 G + µn(x) E[µn(x)] 2 G

because recall that rn+1k(x, Xn+1) H(0). We also used that (Xn+1, Yn+1) is independent of Fn and µn(x) is measurable w.r.t. Fn. The third term can be bounded as follows

I3 .= a2 n+1E h ℓ( , Yn+1)kn+1(x, Xn+1) E ℓ( , Yn+1)kn+1(x, Xn+1) 2 G | Fn i

E h ℓ( , Yn+1)kn+1(x, Xn+1) 2 G i + E ℓ( , Yn+1)kn+1(x, Xn+1) 2 G

2E h ℓ( , Yn+1)kn+1(x, Xn+1), E ℓ( , Yn+1)kn+1(x, Xn+1) G i

= a2 n+1 E h ℓ(Yn+1, Yn+1)k2 n+1(x, Xn+1) i E ℓ( , Yn+1)kn+1(x, Xn+1) 2 G

H2(0)a2 n+1 r2 n+1 E[ℓ(Y, Y )].

The fourth term is less than 0 because

I4 = 2an+1E h µn(x) E[µn(x)], µn(x)kn+1(x, Xn+1) E[µn(x)kn+1(x, Xn+1)]

Recursive Estimation of Conditional Kernel Mean Embeddings

= 2an+1 µn(x) E[µn(x)], µn(x)E[kn+1(x, Xn+1)] E[µn(x)]E[kn+1(x, Xn+1)]

G = 2an+1 µn(x) E[µn(x)] 2 G E[kn+1(x, Xn+1)] 0. (118)

It is easy to see that I5 = 0, because of independence we have

E h µn(x) E[µn(x)], ℓ( , Yn+1)kn+1(x, Xn+1) E[ℓ( , Yn+1)kn+1(x, Xn+1)]

= µn(x) E[µn(x)], E[ℓ( , Yn+1)kn+1(x, Xn+1)] E[ℓ( , Yn+1)kn+1(x, Xn+1)]

For the last term we use the Cauchy-Schwartz inequality and 2ab a2 + b2 to show that

I6 = 2a2 n+1E h µn(x)kn+1(x, Xn+1) E[µn(x)kn+1(x, Xn+1)],

ℓ( , Yn+1)kn+1(x, Xn+1) E[ℓ( , Yn+1)kn+1(x, Xn+1)]

E h µn(x)kn+1(x, Xn+1), ℓ( , Yn+1)kn+1(x, Xn+1)

E[µn(x)kn+1(x, Xn+1)], E[ℓ( , Yn+1)kn+1(x, Xn+1) | Fn]

G E[µn(x)kn+1(x, Xn+1) | Fn], E[ℓ( , Yn+1)kn+1(x, Xn+1)]

+ E[µn(x)kn+1(x, Xn+1)], E[ℓ( , Yn+1)kn+1(x, Xn+1) | Fn]

= 2a2 n+1E k2 n+1(x, Xn+1) µn(x), ℓ( , Yn+1)

+ 2a2 n+1E[kn+1(x, Xn+1)] µn(x), E[ℓ( , Yn+1)kn+1(x, Xn+1)]

2a2 n+1E[k2 n+1(x, Xn+1) µn(x) G

ℓ(Yn+1, Yn+1) | Fn]

+ 2a2 n+1E[kn+1(x, Xn+1)] µn(x) G E[ℓ( , Yn+1)kn+1(x, Xn+1)] G

2a2 n+1 µn(x) G E H2(0)

ℓ(Yn+1, Yn+1) | Fn

+ 2a2 n+1 µn(x) G H2(0)

r2 n+1 E[ ℓ( , Yn+1) G]

= 4H2(0)a2 n+1 r2 n+1 µn(x) G E hq

ℓ(Y, Y ) i 2H2(0)a2 n+1 r2 n+1

µn(x) 2 G + E[ℓ(Y, Y )] .

By summarizing these upper bounds we have almost surely that

E h µn+1(x) E[µn+1(x)] 2 G | Fn i (121)

1 + H2(0)a2 n+1 r2 n+1

µn(x) E[µn(x)] 2 G + 3H2(0)a2 n+1 r2 n+1

µn(x) 2 G + E[ℓ(Y, Y )]

for all x X. By Lemma 22 one can integrate w.r.t. QX to prove that

E h Z µn+1(x) E[µn+1(x)] 2 G d QX(x) | Fn i

1 + H2(0)a2 n+1 r2 n+1

Z µn(x) E[µn(x)] 2 G d QX(x)

+ 3H2(0)a2 n+1 r2 n+1

Z µn(x) 2 G d QX(x) + E[ℓ(Y, Y )] .

Tam as and Cs aji

Observe that E[ℓ(Y, Y )] < and E R µn(x) 2 G d QX(x) is convergent, hence also bounded. Consequently, from X

H2(0)a2 n+1 r2 n+1 <

it follows that R µn+1(x) E[µn+1(x)] 2 G d QX(x) converges almost surely by Theorem 30. The almost surely constant limit value is zero because of our previous argument.

In this appendix, for convenience, we state the main theorems that are crucial for our proofs. An important result is the generalized SLLN for Hilbert space valued random elements. The following theorem is from the book of Taylor (1978, Theorem 3.2.4).

Theorem 23 If (Xn)n N is a sequence of independent random elements in a separable Hilbert space such that X

n2 < , (123)

where var (Xn) .= E h Xn EXn 2 i , then

i=1 (Xi EXi)

a.s. 0. (124)

We referred to (Ljung et al., 2012, Theorem 1.9) to prove the consistency of our recursive (unconditional) kernel mean embedding estimate of a marginal distribution.

Theorem 24 Let G be a real separable Hilbert space endowed with the Borel σ-algebra and Λ : G G be measurable and µ G (usually but not necessarily Λ(µ) = 0). Let further µn, Wn, Hn, Vn be G-valued random elements for n N with µ0 = 0,

µn+1 .= µn an(Λ(µn) Wn) and Wn = Hn + Vn, (125)

where an 0, P n a2 n < and P n an = . Assume that

(i) There exists c > 0 such that for all g G we have Λ(g) G c(1 + g G).

(ii) For all K [1, ) we have inf{ Λ(g), g µ G | g G with 1

K g µ G K} > 0.

(iii) For all n N random elements Hn and Vn are square integrable with

n an E Hn G < , X

n a2 n E[ Hn 2 G] < ,

E Vn | µ1, H1, V1, . . . , Hn 1, Vn 1 = 0, and X

n a2 n E[ Vn 2 G] < . (126)

Then, µn µ (n ) almost surely.

Recursive Estimation of Conditional Kernel Mean Embeddings

We applied the following forms of the fundamental topological results from (Nagy, 2001, Lemma 5.1) and (Nagy, 2001, Theorem 5.1) to prove Theorem 21.

Lemma 25 Let (X, T ) be a locally compact Hausdorﬀspace. Let K be a compact subset of X, and let U be an open set, with K U. Then, there exists an open set E such that E is compact and K E s E U.

Lemma 26 (Urysohn) Let (X, T ) be a locally compact Hausdorﬀspace. Let K and F be disjoint subsets of X, where K is compact and F is closed. Then, there exists a continuous function f : X [0, 1] such that f|K = 1 and f|F = 0.

We also used the following well-known facts from real analysis, see (Dudley, 2002, Corollary 2.4.6), (Rudin, 1976, Theorem 4.15) and (Ruder, 1966, Theorem 2)

Theorem 27 (Heine-Cantor) A continuous function from a compact metric space to any metric space is uniformly continuous.

Theorem 28 (Weierstrass) If f is a continuous mapping of a compact metric space X into R, then f is bounded.

Lemma 29 (Toeplitz) Let (an,i)n,i N be a doubly indexed array of real numbers such that

i=1 |an,i| < , and lim n an,i = 0, (127)

for any i N. If xn x as n , then limn P i=1 an,ixi = 0.

We applied a simpliﬁed version of the theorem of Robbins and Siegmund (1971, Theorem 1) to prove the strong consistency of our recursive algorithm.

Theorem 30 (almost supermartingale convergence) Let F1 F2 F be a ﬁltration, i.e., a nondecreasing sequence of σ-algebras. For every n N let Vn, αn and βn be non-negative Fn measurable random variables such that

E[ Vn+1 | Fn ] (1 + αn)Vn + βn (128)

holds almost surely. If P n=1 E[αn] < and P n=1 E[βn] < hold almost surely, then (Vn)n N converges almost surely to a ﬁnite limit.

The proof of Corollary 14 used the following theorem (Lee, 2006, Theorem 6.13).

Theorem 31 (Hopf-Rinow) A connected Riemannian manifold is geodesically complete if and only if it is complete as a metric space.

We used the Lipschitzness of smooth maps with bounded derivatives between Riemannian manifolds w.r.t. the geodesical distance. A published version of the following proof was not found by us, however, the argument is straightforward.

Tam as and Cs aji

Lemma 32 Let (M, g) and (N, h) be Riemannian manifolds and f a smooth M N map. If there exists C > 0 such that Df(x) C, for all x M, where denotes the operator norm, then for all x, y M one has

d N(f(x), f(y)) C d M(x, y), (129)

where d M and d N are the geodesical distances.

Proof Let x, y M. For all ε > 0 there exists a smooth curve γ : [0, 1] M such that γ(0) = x and γ(1) = y and

γ (t) g d M(x, y) + ε. (130)

Observe that f γ is a smooth curve from f(x) to f(y) in N, hence

d N(f(x), f(y)) Z 1

(f γ) (t) h dt = Z 1

Df(γ(t))(γ (t)) h dt

0 Df(γ(t)) γ (t) g dt C (d M(x, y) + ε) (131)

holds. It is true for all ε > 0 thus d N(f(x), f(y)) C d M(x, y) also holds.

Nachman Aronszajn. Theory of Reproducing Kernels. Transactions of the American Mathematical Society, 68(3):337 404, 1950.

Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media, 2004.

Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, 1996.

Denis Bosq and Michel Delecroix. Nonparametric prediction of a Hilbert space valued random variable. Stochastic Processes and their Applications, 19(2):271 280, 1985.

Luc Brogat-Motte, Alessandro Rudi, C eline Brouard, Juho Rousu, and Florence d Alch e Buc. Vector-valued least-squares regression under output regularity assumptions. Journal of Machine Learning Research, 23(344):1 50, 2022.

Bal azs Csan ad Cs aji and Ambrus Tam as. Semi-parametric uncertainty bounds for binary classiﬁcation. IEEE Conference on Decision and Control (CDC), pages 4427 4432, 2019.

Sophie Dabo-Niang and Noureddine Rhomari. Kernel regression estimation in a Banach space. Journal of Statistical Planning and Inference, 139(4):1421 1434, 2009.

Luc Devroye. On the almost everywhere convergence of nonparametric regression function estimates. Annals of Statistics, pages 1310 1319, 1981.

Recursive Estimation of Conditional Kernel Mean Embeddings

Nicolae Dinculeanu. Vector Integration and Stochastic Integration in Banach Spaces, volume 48. John Wiley & Sons, 2000.

Richard M Dudley. Real Analysis and Probability. Cambridge University Press, 2002.

Lawrence C Evans. Partial Diﬀerential Equations, volume 19. American Mathematical Soc., 2010.

Fr ed eric Ferraty and Philippe Vieu. Nonparametric Functional Data Analysis: Theory and Practice, volume 76. Springer, 2006.

Liliana Forzani, Ricardo Fraiman, and Pamela Llop. Consistent nonparametric regression for functional data under the Stone Besicovitch conditions. IEEE Transactions on Information Theory, 58(11):6697 6708, 2012.

Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Sch olkopf. Kernel measures of conditional dependence. Advances in Neural Information Processing Systems, pages 489 496, 2007.

Kenji Fukumizu, Le Song, and Arthur Gretton. Kernel Bayes rule: Bayesian inference with positive deﬁnite kernels. Journal of Machine Learning Research, 14(1):3753 3783, 2013.

Arthur Gretton and L aszl o Gy orﬁ. Nonparametric independence tests: Space partitioning and kernel approaches. International Conference on Algorithmic Learning Theory, pages 183 198, 2008.

Arthur Gretton, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Sch olkopf, and Alex Smola. A kernel statistical test of independence. Advances in Neural Information Processing Systems, pages 585 592, 2008.

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(1):723 773, 2012.

Steﬀen Gr unew alder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massimilano Pontil. Conditional mean embeddings as regressors. International Conference on Machine Learning, pages 1803 1810, 2012.

Steﬀen Gr unew alder, Guy Lever, Luca Baldassarre, Massi Pontil, and Arthur Gretton. Modelling transition dynamics in MDPs with RKHS embeddings. International Conference on Machine Learning, pages 535 542, 2012.

L aszl o Gy orﬁand Roi Weiss. Universal consistency and rates of convergence of multiclass prototype algorithms in metric spaces. Journal of Machine Learning Research, 22(151): 1 25, 2021.

L aszl o Gy orﬁ, Michael Kohler, Adam Krzyzak, and Harro Walk. A Distribution-Free Theory of Nonparametric Regression. Springer, 2002.

Tam as and Cs aji

Steve Hanneke, Aryeh Kontorovich, Sivan Sabato, and Roi Weiss. Universal Bayes consistency in metric spaces. Information Theory and Applications Workshop, pages 1 33, 2020.

Tuomas Hyt onen, Jan Van Neerven, Mark Veraar, and Lutz Weis. Analysis in Banach Spaces, volume 12. Springer, 2016.

Olav Kallenberg. Foundations of Modern Probability. Springer Science & Business Media, second edition, 2002.

Ilja Klebanov, Ingmar Schuster, and Timothy John Sullivan. A rigorous theory of conditional mean embeddings. SIAM Journal on Mathematics of Data Science, 2(3):583 606, 2020.

Harold J Kushner and G George Yin. Stochastic Approximation and Recursive Algorithms and Applications, volume 35. Springer Science & Business Media, second edition, 2003.

Peter D Lax. Functional Analysis, volume 55. John Wiley & Sons, 2002.

John M Lee. Riemannian Manifolds: An Introduction to Curvature. Springer Science & Business Media, 2006.

John M Lee. Introduction to Smooth Manifolds. Springer Science & Business Media, 2013.

Tong Lin and Hongbin Zha. Riemannian manifold learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5):796 809, 2008.

Lennart Ljung, Georg Pﬂug, and Harro Walk. Stochastic Approximation and Optimization of Random Systems, volume 17. Birkh auser, 2012.

Charles A. Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural Computation, 17(1):177 204, 2005.

Gabriel Nagy. Real analysis, 2001. Electronic textbook at Kansas State University.

Junhyung Park and Krikamol Muandet. A measure-theoretic approach to kernel conditional mean embeddings. Advances in Neural Information Processing Systems, pages 21247 21259, 2020.

Gilles Pisier. Martingales in Banach Spaces, volume 155. Cambridge University Press, 2016.

James O Ramsay and Bernhard W Silverman. Functional Data Analysis. Springer, 2005.

Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. Optimizing Methods in Statistics, pages 233 257. Elsevier, 1971.

Brian Ruder. The Silverman-Toeplitz theorem. Kansas State University, 1966.

Walter Rudin. Principles of Mathematical Analysis. Mc Graw-Hill Book Company, New York, 1976.

Recursive Estimation of Conditional Kernel Mean Embeddings

Walter Rudin. Real and Complex Analysis. Mc Graw-Hill Book Company, New York, 1987.

Bernhard Sch olkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002.

Bernhard Sch olkopf, Krikamol Muandet, Kenji Fukumizu, Stefan Harmeling, and Jonas Peters. Computing functions of random variables via reproducing kernel Hilbert space representations. Statistics and Computing, 25(4):755 766, 2015.

John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.

Alex Smola, Arthur Gretton, Le Song, and Bernhard Sch olkopf. A Hilbert space embedding for distributions. International Conference on Algorithmic Learning Theory, pages 13 31, 2007.

Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. International Conference on Machine Learning, pages 961 968, 2009.

Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A uniﬁed kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98 111, 2013.

Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science & Business Media, 2008.

Arthur H Stone. Paracompactness and product spaces. Bulletin of the American Mathematical Society, 54(10):977 982, 1948.

Charles J Stone. Consistent nonparametric regression. Annals of Statistics, pages 595 620, 1977.

Ambrus Tam as and Bal azs Csan ad Cs aji. Exact distribution-free hypothesis tests for the regression function of binary classiﬁcation via conditional kernel mean embeddings. IEEE Control Systems Letters, 6:860 865, 2022.

Robert L Taylor. Stochastic Convergence of Weighted Sums of Random Elements in Linear Spaces, volume 672. Springer, 1978.

Joshua B Tenenbaum, Vin de Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000.

Ilya Tolstikhin, Bharath K Sriperumbudur, and Krikamol Muandet. Minimax estimation of kernel mean embeddings. Journal of Machine Learning Research, 18(1):3002 3048, 2017.

Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Sch olkopf. Kernel-based conditional independence test and application in causal discovery. Conference on Uncertainty in Artiﬁcial Intelligence, pages 804 813, 2011.