# consistent_interpolating_ensembles_via_the_manifoldhilbert_kernel__4bce9b8b.pdf

Consistent Interpolating Ensembles via the Manifold-Hilbert Kernel

Yutong Wang University of Michigan yutongw@umich.edu

Clayton D. Scott University of Michigan clayscot@umich.edu

Recent research in the theory of overparametrized learning has sought to establish generalization guarantees in the interpolating regime. Such results have been established for a few common classes of methods, but so far not for ensemble methods. We devise an ensemble classification method that simultaneously interpolates the training data, and is consistent for a broad class of data distributions. To this end, we define the manifold-Hilbert kernel for data distributed on a Riemannian manifold. We prove that kernel smoothing regression and classification using the manifold-Hilbert kernel are weakly consistent in the setting of Devroye et al. [22]. For the sphere, we show that the manifold-Hilbert kernel can be realized as a weighted random partition kernel, which arises as an infinite ensemble of partition-based classifiers.

1 Introduction

Ensemble methods are among the most often applied learning algorithms, yet their theoretical properties have not been fully understood [12]. Based on empirical evidence, Wyner et al. [42] conjectured that interpolation of the training data plays a key role in explaining the success of Ada Boost and random forests. However, while a few classes of learning methods have been analyzed in the interpolating regime [6, 4], ensembles have not.

Towards developing the theory of interpolating ensembles, we examine an ensemble classification method for data distributed on the sphere, and show that this classifier interpolates the training data and is consistent for a broad class of data distributions. To show this result, we develop two additional contributions that may be of independent interest. First, for data distributed on a Riemannian manifold M, we introduce the manifold-Hilbert kernel KH M, a manifold extension of the Hilbert kernel [39]. Under the same setting as Devroye et al. [22], we prove that kernel smoothing regression with KH M is weakly consistent while interpolating the training data. Consequently, the classifier obtained by taking the sign of the kernel smoothing estimate has zero training error and is consistent.

Second, we introduce a class of kernels called weighted random partition kernels. These are kernels that can be realized as an infinite, weighted ensemble of partition-based histogram classifiers. Our main result is established by showing that when M = Sd, the d-dimensional sphere, the manifold Hilbert kernel is a weighted random partition kernel. In particular, we show that on the sphere, the manifold-Hilbert kernel is a weighted ensemble based on random hyperplane arrangements. This implies that the kernel smoothing classifier is a consistent, interpolating ensemble on Sd. To our knowledge, this is the first demonstration of an interpolating ensemble method that is consistent for a broad class of distributions in arbitrary dimensions.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Figure 1: An example of the weighted infinite-ensemble bu(x Dn, Krp M) (defined in Eqn. 4) of histogram classifiers bh(x Dn, P) (defined in Eqn. 2). The partitions P1, P2 are induced by hyperplane arrangements (denoted by dotted lines).

1.1 Problem statement

Consider the problem of binary classification on a Riemannian manifold M. Let (X, Y ) be random variables jointly distributed on M { 1}. Let Dn := {(Xi, Yi)}n i=1 be the (random) training data consisting of n i.i.d copies of X, Y . A classifier, i.e., a mapping from Dn to a function bf( Dn) : M { 1}, has the interpolating-consistent property if, when X has a continuous distribution, both of the following hold: 1) bf(Xi Dn) = Yi, for all i {1, . . . , n}, and 2)

Pr{ bf(X Dn) = Y } inf f:M { 1} measurable Pr{f(X) = Y } in probability as n . (1)

Our goal is to find an interpolating-consistent ensemble of histogram classifiers, to be defined below.

A partition on M, denoted by P, is a set of subsets of M such that P P = for all P, P P and M = S

P P P. Given x M, let P[x] denote the unique element P P such that x P. The set of all partitions on a space M is denoted Part(M). The histogram classifier with respect to Dn over P is the sign of the function bh( Dn, P) : M R given by

bh(x Dn, P) :=

i=1 Yi I{x P[Xi]}, (2)

where I is the indicator function. See Figure 1-left panels. Definition 1.1. A weighted random partition (WRP) over M is a 3-tuple (Θ, P, α) consisting of (i) parameter space of partitions: a set Θ where Pθ Part(M) for each θ Θ, (ii) random partitions: a probability measure P on Θ, and (iii) weights: a nonnegative function α : Θ R 0 that is integrable with respect to the measure P. Example 1.2 (Regular partition of the d-cube). Let M = [0, 1]d and Θ = {1, 2 . . . } =: N+. For each n N+, denote by Pn the regular partition of M into nd d-cubes of side length 1/n. For any probability mass function P on N+ and weights α : N+ R 0, the 3-tuple (Θ, P, α) is a WRP.

Below, WRPs will be denoted with 2-letter names in the sans-serif font, e.g., rp for a generic WRP, and ha for the weighted hyperplane arrangement random partition (Definition 5.1). The weighted random partition kernel associated to rp = (Θ, P, α) is defined as Krp M : M M R 0 { }, Krp M(x, z) := Eθ P[α(θ)I{x Pθ[z]}]. (3) When α 1, we recover the notion of unweighted random partition kernel introduced in [21]. Note that the kernel is symmetric since I{x Pθ[z]} = I{z Pθ[x]}. If Krp M < , then Krp M is a positive definite (PD) kernel. When Krp M can evaluate to , the definition of a PD kernel is not applicable since the positive definite property is defined only for to kernels taking finite values [10].

Let sgn : R { } { 1} be the sign function. For a WRP, define the weighted infinite-ensemble

bu(x Dn, Krp M) :=

i=1 Yi Krp M(x, Xi) = Eθ P[α(θ)bh(x Dn, Pθ)]. (4)

Note that the equality on the right follows immediately from linearity of the expectation and the definition of bh( Dn, Pθ) in Equation (2). See Figure 1-right panel.

Main problem. Find a WRP such that sgn(bu( Dn, Krp M)) has the interpolating-consistent property. 2

1.2 Outline of approach and contributions

In the regression setting, we have (X, Y ) jointly distributed on M R. Let m(x) := E[Y |X = x]. Recall from Belkin et al. [8, Equation (7)] the definition of the kernel smoothing estimator with a so-called singular1 kernel K : M M [0, + ]:

bm(x Dn, K) :=

Yi : i [n] such that x = Xi Pn i=1 Yi K(x,Xi) Pn j=1 K(x,Xj) : Pn j=1 K(x, Xj) > 0

0 : otherwise.

We note that Equation (5) is referred as the Nadaraya-Watson estimate in [8]. Now, we simply write bmn(x) instead of bm(x Dn, K) when there is no ambiguity. Similarly, we write bun(x) instead of bu(x Dn, K) from earlier. Note that sgn( bmn(x)) = sgn(bun(x)) if Pn j=1 K(x, Xj) > 0.

Observe that bmn is interpolating by construction. Let µX denote the marginal distribution of X. The L1-error of bmn in approximating m is Jn := R

M | bmn(x) m(x)|µX(dx). For M = Rd and the Hilbert kernel defined by KH Rd(x, z) := x z d, Devroye et al. [22] proved L1-consistency for regression: Jn 0 in probability when Y is bounded and X is continuously distributed.

Our contributions. Our primary contribution is to demonstrate an ensemble method with the consistent-interpolating property. Toward this end, in Section 3, we introduce the manifold-Hilbert kernel KH M on a Riemannian manifold M. When show that when M is complete, connected, and smooth, kernel smoothing regression with KH M has the same consistency guarantee (Theorem 3.2) as KH Rd mentioned in the preceding paragraph. In Section 5, we consider the case when M = Sd, and show that the manifold-Hilbert kernel KH Sd is a weighted random partition kernel (Proposition 5.2).

Devroye et al. [22, Section 7] observed that the L1-consistency of bmn for regression implies the consistency for classification of sgn bun. Furthermore, bmn is interpolating for regression implies that sgn bun is interpolating for classification. These observations together with our results demonstrate the existence of a weighted infinite-ensemble classifier with the interpolating-consistent property.

1.3 Related work

Kernel regression. Kernel smoothing regression, or simply kernel regression, is an interpolator when the kernel used is singular, a fact known to Shepard [39] in 1968. Devroye et al. [22] showed that kernel regression with the Hilbert kernel is interpolating and weakly consistent for data with a density and bounded labels. Using singular kernels with compact support, Belkin et al. [8] showed that minimax optimality can be achieved under additional distributional assumptions.

Random forests. Wyner et al. [42] proposed that interpolation may be a key mechanism for the success of random forests and gave a compelling intuitive rationale. Belkin et al. [6] studied empirically the double descent phenomenon in random forests by considering the generalization performance past the interpolation threshold. The PERT variant of random forests, introduced by Cutler and Zhao [20], provably interpolates in 1-dimension. Belkin et al. [7] pose as an interesting question whether the result of Cutler and Zhao [20] extends to higher dimension. Many work have established consistency of random forest and its variants under different settings [15, 11, 38]. However, none of these work addressed interpolation.

Boosting. For classification under the noiseless setting (i.e., the Bayes error is zero), Ada Boost is interpolating and consistent (see Freund and Schapire [26, first paragraph of Chapter 12]). However, this setting is too restrictive and the result does not answer if consistency is possible when fitting the noise. Bartlett and Traskin [5] proved that Ada Boost with early stopping is universally consistent, however without the interpolation guarantee. To the best of our knowledge, whether Ada Boost or any other variant of boosting can be interpolating and consistent remains open.

Random partition kernels. Breiman [16] and Geurts et al. [28] studied infinite ensembles of simplified variants of random forest and connections to certain kernels. Davies and Ghahramani [21] formalized this connection and coined the term random partition kernel. Scornet [37] further developed the theory of random forest kernels and obtained upper bounds on the rate of convergence. However, it is not clear if these variants of random forests are interpolating.

1The singular modifier refers to the fact that K(x, x) = + for all x M.

Previously defined (unweighted) random partition kernels are bounded, and thus cannot be singular. On the other hand, the manifold-Hilbert kernel is always singular. To bridge between ensemble methods and theory on interpolating kernel smoothing regression, we propose weighted random partitions (Definition 1.1), whose associated kernel (Equation 3) can be singular.

Learning on Riemannian manifolds. Strong consistency of a kernel-based classification method on manifolds has been established by Loubes and Pelletier [32]. However, the result requires the kernel to be bounded and thus the method is not guaranteed to be interpolating. See Feragen and Hauberg [25] for a review of theoretical results regarding kernels on Riemannian manifolds.

Beyond kernel methods, other classical methods for Euclidean data have been extended to Riemannian manifolds, e.g., regression [40], classification [43], and dimensionality reduction and clustering [44][34]. To the best of our knowledge, no previous works have demonstrated an interpolatingconsistent classifiers on manifolds other than Rd.

In many applications, the data naturally belong to a Riemannian manifold. Spherical data arise from a range of disciplines in natural sciences. See the influential textbook by Mardia and Jupp [33, Ch.1 4]. For applications of the Grassmanian manifold in computer vision, see Jayasumana et al. [30] and the references therein. Topological data analysis [41] presents another interesting setting of manifold-valued data in the form of persistence diagrams [3, 31].

2 Background on Riemannian Manifolds

We give an intuitive overview of the necessary concepts and results on Riemannian manifolds. A longer, more precise version of this overview is in the Supplemental Materials Section A.1.

A smooth d-dimensional manifold M is a topological space that is locally diffeomorphic2 to open subsets of Rd. For simplicity, suppose that M is embedded in RN for some N d, e.g., Sd Rd+1. Let x M be a point. The tangent space at x, denoted Tx M, is the set of vectors that is tangent to M at x. Since linear combinations of tangent vectors are also tangent, the tangent space Tx M is a vector space. Tangent vectors can also be viewed as the time derivative of smooth curves. In particular, let x M. If ϵ > 0 is an open set and γ : ( ϵ, ϵ) M is a smooth curve such that γ(0) = x, then dγ

dt (0) Tx M.

A Riemannian metric on M is a choice of inner product , x on Tx M for each x such that , x varies smoothly with x. Naturally, z x := p

z, z x defines a norm on Tx M. The length of a piecewise smooth curve γ : [a, b] M is defined by len(γ) := R b a γ(t) γ(t)dt. Define dist M(x, ξ) := inf{len(γ) : γ is a piecewise smooth curve from x to ξ}, which is a metric on M in the sense of metric spaces (see Sakai [36, Proposition 1.1]). For x M and r (0, ), the open metric ball centered at x of radius r is denoted Bx(r, M) := {ξ M : dist M(x, ξ) < r}.

A curve γ : [a, b] M is a geodesic if γ is locally distance minimizing and has constant speed, i.e., dγ

dt (τ) γ(τ) is constant. Now, suppose x M and v Tx M are such that there exists a geodesic γ : [0, 1] M where γ(0) = x and dγ

dt (0) = v. Define expx(v) := γ(1), the element reached by traveling along γ at time = 1. See Figure 2 for the case when M = S2.

For a fixed x M, the above function expx, the exponential map, can be defined on an open subset of Tx M containing the origin. The Hopf-Rinow theorem ([23, Ch. 8, Theorem 2.8]) states that if M is connected and complete with respect to the metric dist M, then expx can be defined on all of Tx M.

3 The Manifold-Hilbert kernel

Throughout the remainder of this work, we assume that M is a complete, connected, and smooth Riemannian manifold of dimension d.

Definition 3.1. We define the manifold-Hilbert kernel KH M : M M [0, ] for each x, ξ M by KH M(x, ξ) := dist M(x, ξ) d if x = ξ and KH M(x, x) := otherwise.

2A diffeomorphism is a smooth bijection whose inverse is also smooth.

π expx logx

z R2 Tx S2 B0(π, R2) S2 \ { x}

1. 2. 3. 4. 5.

Figure 2: An illustration of the exponential map expx for the manifold M = S2, where x is the northpole (blue) and x the southpole (orange). The logarithm map logx, discussed in Section 4.1, is a right-inverse to expx, i.e., expx logx is the identity. Panel 1. The tangent space Tx S2 visualized as R2. The dashed circle encloses a disc of radius π. Panel 2. The tangent space realized as the hyperplane tangent to sphere at x. Panels 3-5. Animation showing expx as a bijection from the open disc of radius π to S2 \ { x}. The entire dashed circle in Panel i is mapped to x the southpole. Thus, logx maps the southpole x to a point z on the dashed circle.

Let λM be the Riemann Lebesgue volume measure of M. Integration with respect to this measure is denoted R

M fdλM for a function f : M R. For details of the construction of λM, see Amann and Escher [1, Proposition 1.5]. When M = Rd, λM is the ordinary Lebesgue measure and R

Rd fdλRd is the ordinary Lebesgue integral. For this case, we simply write λ instead of λRd.

We now state our first main result, a manifold theory extension of Devroye et al. [22, Theorem 1]. Theorem 3.2. Suppose that X has a density f X with respect to λM and that Y is bounded. Let PY |X be a conditional distribution of Y given X and m Y |X be its conditional expectation. Let bmn(x) := bm(x Dn, KH M). Then

1. at almost all x M with f X(x) > 0, we have bmn(x) m Y |X(x) in probability, 2. Jn := R

M | bmn(x) m Y |X(x)|f X(x)dλM(x) 0 in probability.

In words, the kernel smoothing regression estimate bmn based on the manifold-Hilbert kernel is consistent and interpolates the training data, provided X has a density and Y is bounded. As a consequence, following the same logic as in Devroye et al. [22], the associated classifier sgn bun has the interpolating-consistent property. Before proving Theorem 3.2, we first review key concepts in probability theory on Riemannian manifolds.

3.1 Probability on Riemannian manifolds

Let BM be the Borel σ-algebra of M, i.e., the smallest σ-algebra containing all open subsets of M. We recall the definition of M-valued random variables, following Pennec [35, Definition 2]: Definition 3.3. Let (Ω, P, A) be a probability space with measure P and σ-algebra A. A M-valued random variable X is a Borel-measurable function Ω M, i.e., X 1(B) A for all B BM. Definition 3.4 (Density). A random variable X taking values in M has a density if there exists a nonnegative Borel-measurable function f : M [0, ] such that for all Borel sets B in M, we have Pr(X B) = R

B fdλM. The function f is said to be a probability density function (PDF) of X.

Next, we recall the definition of conditional distributions, following Dudley [24, Ch. 10 2]: Definition 3.5 (Conditional distribution3). Let (X, Y ) be a random variable jointly distributed on M R. Let PX( ) be the probability measure corresponding to the marginal distribution of X. A conditional distribution for Y given X is a collection of probability measures PY |X( |x) on R indexed by x M satisfying the following:

1. For all Borel sets A R, the function M x 7 PY |X(A|x) [0, 1] is Borel-measurable. 2. For all A R and B M Borel sets, Pr(Y A, X B) = R

B PY |X(A|x)PX(dx).

The conditional expectation4 is defined as m Y |X(x) := R

R y PY |X(dy|x).

3also known as disintegration measures according to Chang and Pollard [18]. 4More often, the conditional expectation is denoted E[Y |X = x]. However, our notation is more convenient for function composition and compatible with that of [22].

The existence of a conditional probability for a joint distribution (X, Y ) is guaranteed by Dudley [24, Theorem 10.2.2]. When (X, Y ) has a joint density f XY and marginal density f X, the above definition gives the classical formula PY |X(A|x) = R

A f XY (x, y)/f X(x)dy when > f X(x) > 0. See the first example in Dudley [24, Ch. 10 2].

3.2 Lebesgue points on manifolds

Devroye et al. [22] proved Theorem 3.2 when M = Rd and, moreover, that part 1 holds for the so-called Lebesgue points, whose definition we now recall.

Definition 3.6. Let f : M R be an absolutely integrable function and x M. We say that x is a Lebesgue point of f if f(x) = limr 0 1 λM(Bx(r,M)) R

Bx(r,M) fdλM.

For an integrable function, the following result states that almost all points are its Lebesgue points. For the proof, see Fukuoka [27, Remark 2.4].

Theorem 3.7 (Lebesgue differentation). Let f : M R be an absolutely integrable function. Then there exists a set A M such that λM(A) = 0 and every x M \ A is a Lebesgue point of f.

Next, for the reader s convenience, we restate Devroye et al. [22, Theorem 1], emphasizing the connection to Lebesgue points. The result will be used in our proof of Theorem 3.2 in the next section. Theorem 3.8 (Devroye et al. [22]). Let M = Rd be the flat Euclidean space. Then Theorem 3.2 holds. Moreover, Part 1 holds for all x that is a Lebesgue point to both f X and m Y |X f X.

4 Proof of Theorem 3.2

The focal point of the first subsection is Lemma 4.1 which shows the Borel measurability of extensions of the so-called Riemannian logarithm. The second subsection contains two key results regarding densities of M-valued random variables transformed by the Riemannian logarithm. The final subsection proves Theorem 3.2 leveraging results from the preceding two subsections.

4.1 The Riemannian logarithm

Throughout, x is assumed to be an arbitrary point of M. Let Ux M = {v Tx M : v x = 1} Tx M denote the set of unit tangent vectors. Define a function τx : Ux M (0, ] as follows5:

τx(u) := sup{t > 0 : t = dist M(x, expx(tu))}.

The tangent cut locus is the set Cx Tx M defined by Cx := {τx(u)u : u Ux M, τx(u) < }. Note that it is possible for τx(u) = for all u Ux M in which case Cx is empty. The cut locus is the set Cx := expx( Cx) M.

The tangent interior set is Ix := {tu : 0 t < τx(u), u Ux M} and the interior set is the set Ix := expx( Ix). Finally, define Dx := Ix Cx. Note that for each z = tu Ix, we have

z x = t = dist M(x, expx(tu)) = dist M(x, expx(z)). (6)

Consider the example where M = S2 as in Figure 2. Then τx(u) = π for all u Ux M. Thus, the tangent interior set Ix = B0(π, R2), the open disc of radius π centered at the origin.

When restricted to Ix, the exponential map expx | Ix : Ix Ix is a diffeomorphism. Its functional inverse, denoted by logx |Ix, is called the Riemannian Logarithm [9, 45]. In previous works, logx |Ix is only defined from Ix to Ix. The next result shows that the domain of logx |Ix : Ix Ix can be extended to logx : M Dx while remaining Borel-measurable. Lemma 4.1. For all x M, there exists a Borel measurable map logx : M Tx M such that logx(M) Dx and expx logx is the identity on M. Furthermore, for all x, ξ M, we have dist M(x, ξ) = logx(ξ) x.

5Positivity of τx is asserted at Sakai [36, eq. (4.1)]

Proof sketch. The full proof of the lemma is provided in Section A.2 of the Supplemental Materials6. Below, we illustrate the idea of the proof using the example when M = S2 as in Figure 2.

Let x S2 be the northpole (the blue point). The tangent cut locus Cx is the dashed circle in the left panel of Figure 2. The exponential map expx is one-to-one on Dx except on the dashed circle, which all gets mapped to x, the southpole (the orange point). A consequence of the measurable selection theorem7 is that logx can be extended to be a Borel-measurable right inverse of expx by selecting z point on Cx such that logx( x) = z.

Thus, we ve shown that logx : M Tx M is Borel-measurable. Now, recall that Tx M is equipped with the inner product , x, i.e., the Riemannian metric. Below, for each x M choose an orthonormal basis on Tx M with respect to , . Then Tx M is isomorphic as an inner product space to Rd with the usual dot product.

Our next two results are change-of-variables formulas for computing the densities/conditional distributions of M-valued random variables after the logx transform. Recall that λM is the Riemann Lebesgue measure on M and λ is the ordinary Lebesgue measure on Rd = Tx M.

Proposition 4.2. Let x M be fixed. There exists a Borel measurable function νx : M R with the following properties:

(i) Let X be a random variable on M with density f X and let Z := logx(X). Then Z is a random variable on Tx M with density f Z(z) := f X(expx(z)) νx(expx(z)). (ii) Let f : M R be an absolutely integrable function such that x is a Lebesgue point of f. Define f : Tx M R by h(z) := f(expx(z)) νx(expx(z)). Then 0 Tx M is a Lebesgue point for h.

Proposition 4.3. Let (X, Y ) have a joint distribution on M R such that the marginal of X has a density f X on M. Let PY |X( | ) be a conditional distribution for Y given X. Let x M. Define Z := logx(X) and consider the joint distribution (Z, Y ) on Tp M R. Then PY |Z( | ) := PY |X( | expx( )) is a conditional distribution for Y given Z. Consequently, m Y |X expx = m Y |Z.

The above propositions are straightforward manifold-theoretic extensions of well-known results on Euclidean spaces. For completeness, the full proofs are in Supplemental Materials Section A.4. An anonymous reviewer brought to our attention that Proposition 4.2 is the consequence of a well-known formula from geometric measure theory, called the area formula [2, p. 44-45].

4.2 Proof of Theorem 3.2

Fix x M such that x is a Lebesgue point of f X and m Y |X f X. Note that by Theorem 3.7, almost all x M has this property. Next, let Z = logx(X) and f Z be as in Proposition 4.2-(i). Then

1. f Z = (f X expx) (νx expx), and 2. (m Y |X expx) f Z = (m Y |X expx) (f X expx) (νx expx).

Now, proposition 4.2-(ii) implies that 0 is a Lebesgue point of both f Z and (m Y |X expx) f Z. Furthermore, by Proposition 4.3, we have m Y |X expx = m Y |Z. Thus, 0 is a Lebesgue point of f Z and m Y |Z f Z.

Now, let Dn := {(Xi, Yi)}i [n]. Define Zi := logx(Xi), which are i.i.d copies of the random variable Z := logx(X), and let Dn := {(Zi, Yi)}i [n]. Then we have

bm(x Dn, KH M) (a) = Pn i=1 Yi dist M(x, Xi) d Pn j=1 dist M(x, Xj) d

(b) = Pn i=1 Yi Zi d x Pn j=1 Zj d x

(c) = Pn i=1 Yi dist Rd(0, Zi) d Pn j=1 dist Rd(0, Zj) d

(d) = bm(0 Dn, KH Rd)

6An anonymous reviewer has provided a shorter, alternative proof of Lemma 4.1. See https://openreview. net/forum?id=zq QKGa NI4lp&note Id=VYOug BMOil 7Kuratowski Ryll-Nardzewski measurable selection theorem (see [14, Theorem 6.9.3])

where equations marked by (a) and (d) follow from Equation (5), (b) from Lemma 4.1, and (c) from the fact that the inner product space Tx M with , x is isomorphic to Rd with the usual dot product. By Theorem 3.8, we have bm(0 Dn, KH Rd) m Y |Z(0) in probability. In other words, for all ϵ > 0,

lim n Pr{| bm(0 Dn, KH Rd) m Y |Z(0)| > ϵ} = 0.

By Proposition 4.3, we have m Y |Z(0) = m Y |Z(expx(0)) = m Y |Z(x). Therefore, n | bm(0 Dn, KH Rd) m Y |Z(0)| > ϵ o = n | bm(x Dn, KH M) m Y |X(x)| > ϵ o

as events. Thus, bm(x Dn, KH M) m Y |X(x) converges in probability, proving Theorem 3.2 part 1. As noted in Devroye et al. [22, 2], part 2 of Theorem 3.2 is an immediate consequence of part 1.

5 Application to the d-Sphere

The d-dimensional round sphere is Sd := {x Rd+1 : x2 1 + + x2 d+1 = 1}. Here, a round sphere assumes that Sd has the arc-length metric:

dist Sd(x, z) = (x, z) = cos 1(x z) [0, π]. (7)

Let S be a set and σ : M S be a function. The partition induced by σ is defined by {σ 1(s) : s Range(σ)}. For example, when M = Sd and W R(d+1) h, then the function σW : Sd { 1}h

defined by σW (x) = sgn(W x) induces a hyperplane arrangement partition.

Let N = {1, 2, . . . } and N0 = N {0} denote the positive and non-negative integers.

Definition 5.1 (Random hyperplane arrangement partition). Let d N and M = Sd. Let q < 0 be a negative number, and let H be a random variable with probability mass function p H : N0 [0, 1] such that p H(h) > 0 for all h. Define the following weighted random partition ha := (Θ, P, α):

1. The parameter space Θ = F h=0 R(d+1) h is the disjoint union of all (d + 1) h matrices. Element of Θ are matrices θ = W R(d+1) h where the number of columns h {0, 1, 2, . . . } varies. By convention, if h = 0, the partition Pθ = PW is the trivial partition {Sd}. If h > 0, PW is the partition induced by x 7 sgn(W x). 2. The probability P is constructed by the procedure where we first sample h p H(h), then sample the entries of W Rd h i.i.d according to Gaussian(0, 1). 3. For θ Θ, define α(θ) := πqp H(h) 1( 1)h q h , where q h := 1

h! Qh 1 j=0 (q j).

Note that ( 1)h q h = 1

h! Qh 1 j=0 ( q + j) > 0 when q < 0.

Theorem 5.2. Let ha = (Θ, P, α) be as in Definition 5.1. Then

Kha Sd(x, z) = (x, z)q : (x, z) = 0 + : otherwise.

When q = d, we have Kha Sd = KH Sd where the right hand side is the manifold-Hilbert kernel.

Proof of Theorem 5.2. Before proceeding, we have the following useful lemma:

Lemma 5.3. Let rp = (Θ, P, α) be a WRP. Let H be a random variable. Let θ P. Suppose that for all x, z M, the random variables α(θ) and I{x Pθ[z]} are conditionally independent given H.

Then we have Krp M(x, z) = EH h

α(H) Eθ P[I{x Pθ[z]}|H] i where α(h) := Eθ P [α(θ)|H = h] for a realization h of H.

The lemma follows immediately from the Definition of Krp M(x, z) in Equation 3 and the conditional independence assumption. Now, we proceed with the proof of Theorem 5.2.

Let ϕ := (x, z)/π. Let H p H and θ P be the random variables in Definition 5.1. Note that by construction, the following condition is satisfied: for all x, z M, the random variables α(θ)

and I{x Pθ[z]} are conditionally independent given H. In fact, α(θ) = πqp H(h) 1( 1)h q h is constant given H = h. Hence, applying Lemma 5.3, we have

Kha Sd(x, z) = EH h

α(H) Eθ P[I{x Pθ[z]}|H] i

h=0 πq( 1)h q h

Eθ P[I{x Pθ[z]}|H = h] =

h=0 πq( 1)h q h

Pr{x Pθ[z]|H = h}.

Next, we claim that Pr{x Pθ[z]|H = h} = (1 ϕ)h. When h = 0, x Pθ[z] is always true since Pθ = {Sd} is the trivial partition. In this case, we have Pr{x Pθ[z]|H = h} = 1 = (1 ϕ)0. When h > 0, we recall an identity involving the cosine angle:

Lemma 5.4 (Charikar [19]). Let x, z Sd. Let w Rd+1 be a random vector whose entries are sampled i.i.d according to Gaussian(0, 1). Then Pr{sgn(w x) = sgn(w z)} = 1 ( (x, z)/π).

Let W = [w1, . . . , wh] be as in Definition 5.1 where wj denotes the j-th column of W. Then by construction, wj is distributed identically as w in Lemma 5.4. Furthermore, wj and wj are independent for j, j [h] where j = j . Thus, the claim follows from

Pr{x Pθ[z]|H = h} (a) = Pr{sgn(W x) = sgn(W z)|H = h}

j=1 Pr{sgn(w j x) = sgn(w j z)} (c) =

j=1 (1 ϕ) = (1 ϕ)h .

where equality (a) follows from Definition 5.1, (b) from W R(d+1) h having i.i.d standard Gaussian entries given H = h, and (c) from Lemma 5.4. Putting it all together, we have

Kpart P,α (x, z) =

h=0 πq( 1)h q h

(1 ϕ)h = πq X

(ϕ 1)h = (x, z)q.

For the last step, we used the fact that for all q R the binomial series (1 + t)q = P h=0 q h th

converges absolutely for |t| < 1 (when ϕ (0, 1]) and diverges to + for t = 1 (when ϕ = 0).

Corollary 5.5. Let q := d and Kha Sd be as in Theorem 5.2. The infinite-ensemble classifier sgn(bu( Dn, Kha Sd)) (see Equation 4 for definition) has the interpolating-consistent property.

Proof. As observed in Devroye et al. [22, Section 7], for an arbitrary kernel K, the L1-consistency of bm( Dn, K) for regression implies the consistency for classification of sgn(bu( Dn, K)). Furthermore, bm( Dn, K) is interpolating for regression implies that sgn(bu( Dn, K)) is interpolating for classification. While the argument there is presented in the Rd case, the argument holds in the more general manifold case mutatis mutandis.

Thus, by Theorem 3.2, we have sgn(bu( Dn, KH Sd)) is consistent for classification, i.e., Equation (1) holds. It is also interpolating since bm( Dn, K) is interpolating. By Proposition 5.2, we have Kha Sd = KH Sd. Thus sgn(bu( Dn, Kha Sd)) is an ensemble method having the interpolating-consistent property.

6 Discussion

We have shown that using the manifold-Hilbert kernel in kernel smoothing regression, also known as Nadaraya-Watson regression, results in a consistent estimator that interpolates the training data on a Riemannian manifold M. We proposed weighted random partition kernels, a generalization of the unweighted analogous definition by Davies and Ghahramani [21] which provided a framework for analyzing ensemble methods such as random forest via kernels. When M = Sd is the sphere, we showed that the manifold-Hilbert kernel is a weighted random partition kernel, where the random partitions are induced by random hyperplane arrangements. This demonstrates an ensemble method that has the interpolating-consistent property.

One limitation of this work is the lack of rate of convergence of the ensemble methods. The analogous result for the Nadaraya-Watson regression have been obtained by Belkin et al. [8]. However, it is not

clear if the kernels used in [8] are weighted random partition kernels. Resolving this is an interesting future direction.

Similar to PERT [20], another limitation of this work is that the base classifiers in the ensemble are data-independent. Such ensemble methods in these line of work (including ours) are easier to analyze than the data-dependent ensemble methods used in practice. See Biau and Scornet [12] and [13] for an in-depth discussion. We believe our work offers one theoretical basis towards understanding generalization in the interpolation regime of ensembles of histogram classifiers over data-dependent partitions, e.g., decision trees à la CART [17].

Acknowledgements

The authors were supported in part by the National Science Foundation under awards 1838179 and 2008074. The authors would like to thank Vidya Muthukumar for helpful discussions, as well as an anonymous reviewer for bringing to our attention facts from geometric measure theory that provided context for the ad-hoc technical result Proposition 4.2.

[1] Herbert Amann and Joachim Escher. Analysis III. Springer, 2009, pp. 389 455. [2] Luigi Ambrosio and Paolo Tilli. Topics on analysis in metric spaces. Vol. 25. Oxford University Press on Demand, 2004. [3] Rushil Anirudh, Vinay Venkataraman, Karthikeyan Natesan Ramamurthy, and Pavan Turaga. A Riemannian framework for statistical analysis of topological persistence diagrams . In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2016, pp. 68 76. [4] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression . In: Proceedings of the National Academy of Sciences 117.48 (2020), pp. 30063 30070. [5] Peter L Bartlett and Mikhail Traskin. Ada Boost is consistent . In: Journal of Machine Learning Research 8.Oct (2007), pp. 2347 2368. [6] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machinelearning practice and the classical bias variance trade-off . In: Proceedings of the National Academy of Sciences 116.32 (2019), pp. 15849 15854. [7] Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate . In: Advances in Neural Information Processing Systems 31 (2018), pp. 2300 2311. [8] Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statistical optimality? In: The 22nd International Conference on Artificial Intelligence and Statistics. PMLR. 2019, pp. 1611 1619. [9] Thomas Bendokat, Ralf Zimmermann, and P-A Absil. A Grassmann manifold handbook: Basic geometry and computational aspects . In: ar Xiv preprint ar Xiv:2011.13699 (2020). [10] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011. [11] Gérard Biau, Luc Devroye, and Gäbor Lugosi. Consistency of random forests and other averaging classifiers. In: Journal of Machine Learning Research 9.9 (2008). [12] Gérard Biau and Erwan Scornet. A random forest guided tour . In: TEST 25.2 (2016), pp. 197 227. [13] Gérard Biau and Erwan Scornet. Rejoinder on: A random forest guided tour . In: Test 25.2 (2016), pp. 264 268. [14] Vladimir Igorevich Bogachev and Maria Aparecida Soares Ruas. Measure theory. Vol. 1. Springer, 2007. [15] Leo Breiman. Consistency for a simple model of random forests . In: (2004). [16] Leo Breiman. Some infinity theory for predictor ensembles. Technical Report. Department of Statistics, 2000.

[17] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification and regression trees. CRC press, 1984. [18] Joseph T Chang and David Pollard. Conditioning as disintegration . In: Statistica Neerlandica 51.3 (1997), pp. 287 317. [19] Moses S Charikar. Similarity estimation techniques from rounding algorithms . In: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. 2002, pp. 380 388. [20] Adele Cutler and Guohua Zhao. PERT-perfect random tree ensembles . In: Computing Science and Statistics 33 (2001), pp. 490 497. [21] Alex Davies and Zoubin Ghahramani. The random forest kernel and other kernels for big data from random partitions . In: ar Xiv preprint ar Xiv:1402.4293 (2014). [22] Luc Devroye, László Györfi, and Adam Krzy zak. The Hilbert kernel regression estimate . In: Journal of Multivariate Analysis 65.2 (1998), pp. 209 227. [23] Manfredo Perdigao Do Carmo. Riemannian Geometry. Vol. 6. Springer, 1992. [24] Richard M Dudley. Real Analysis and Probability. CRC Press, 2018. [25] Aasa Feragen and Søren Hauberg. Open Problem: Kernel methods on manifolds and metric spaces. What is the probability of a positive definite geodesic exponential kernel? In: Conference on Learning Theory. PMLR. 2016, pp. 1647 1650. [26] Yoav Freund and Robert E Schapire. Boosting: Foundations and Algorithms . In: MIT Press 1.6 (2012), p. 7. [27] Ryuichi Fukuoka. Mollifier smoothing of tensor fields on differentiable manifolds and applications to Riemannian Geometry . In: ar Xiv preprint math/0608230 (2006). [28] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees . In: Machine learning 63.1 (2006), pp. 3 42. [29] James J Hebda. Parallel translation of curvature along geodesics . In: Transactions of the American Mathematical Society 299.2 (1987), pp. 559 572. [30] Sadeep Jayasumana, Richard Hartley, Mathieu Salzmann, Hongdong Li, and Mehrtash Harandi. Kernel methods on Riemannian manifolds with Gaussian RBF kernels . In: IEEE transactions on pattern analysis and machine intelligence 37.12 (2015), pp. 2464 2477. [31] Tam Le and Makoto Yamada. Persistence Fisher kernel: A Riemannian manifold kernel for persistence diagrams . In: Advances in Neural Information Processing Systems 31 (2018). [32] Jean-Michel Loubes and Bruno Pelletier. A kernel-based classifier on a Riemannian manifold . In: Statistics & Decisions 26.1 (2008), pp. 35 51. [33] Kanti V Mardia and Peter E Jupp. Directional statistics. Vol. 2. Wiley Online Library, 2000. [34] Kanti V Mardia, Henrik Wiechers, Benjamin Eltzner, and Stephan F Huckemann. Principal component analysis and clustering on manifolds . In: Journal of Multivariate Analysis 188 (2022), p. 104862. [35] Xavier Pennec. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements . In: Journal of Mathematical Imaging and Vision 25.1 (2006), pp. 127 154. [36] Takashi Sakai. Riemannian Geometry. Vol. 149. American Mathematical Society, 1996. [37] Erwan Scornet. Random forests and kernel methods . In: IEEE Transactions on Information Theory 62.3 (2016), pp. 1485 1500. [38] Erwan Scornet, Gérard Biau, and Jean-Philippe Vert. Consistency of random forests . In: The Annals of Statistics 43.4 (2015), pp. 1716 1741. [39] Donald Shepard. A two-dimensional interpolation function for irregularly-spaced data . In: Proceedings of the 1968 23rd ACM national conference. 1968, pp. 517 524. [40] P Thomas Fletcher. Geodesic regression and the theory of least squares on Riemannian manifolds . In: International journal of computer vision 105.2 (2013), pp. 171 185. [41] Larry Wasserman. Topological data analysis . In: Annual Review of Statistics and Its Application 5 (2018), pp. 501 532. [42] Abraham J Wyner, Matthew Olson, Justin Bleich, and David Mease. Explaining the success of Ada Boost and random forests as interpolating classifiers . In: The Journal of Machine Learning Research 18.1 (2017), pp. 1558 1590. [43] Zhigang Yao and Zhenyue Zhang. Principal boundary on Riemannian manifolds . In: Journal of the American Statistical Association 115.531 (2020), pp. 1435 1448.

[44] Zhenyue Zhang and Hongyuan Zha. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment . In: SIAM journal on scientific computing 26.1 (2004), pp. 313 338. [45] Ralf Zimmermann. A matrix-algebraic algorithm for the Riemannian logarithm on the Stiefel manifold under the canonical metric . In: SIAM Journal on Matrix Analysis and Applications 38.2 (2017), pp. 322 342.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See the Section 6.

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] This is the opening sentence of our first main technical section, Section 3. (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]