# geometry_and_stability_of_supervised_learning_problems__b3051d48.pdf

Journal of Machine Learning Research 26 (2025) 1-99 Submitted 3/24; Revised 8/25; Published 8/25

Geometry and Stability of Supervised Learning Problems

Facundo M emoli facundo.memoli@gmail.com Department of Mathematics Rutgers University, Piscataway, NJ 08854

Brantley Vose vose.5@osu.edu Department of Mathematics The Ohio State University, Columbus, OH 43210

Robert C. Williamson Bob.Williamson@uni-tuebingen.de University of T ubingen and T ubingen AI Center, 72076 T ubingen, Germany

Editor: Pierre Alquier

We introduce a notion of distance between supervised learning problems, which we call the Risk distance. This distance, inspired by optimal transport, facilitates stability results; one can quantify how seriously issues like sampling bias, noise, limited data, and approximations might change a given problem by bounding how much these modiﬁcations can move the problem under the Risk distance. With the distance established, we explore the geometry of the resulting space of supervised learning problems, providing explicit geodesics and proving that the set of classiﬁcation problems is dense in a larger class of problems. We also provide two variants of the Risk distance: one that incorporates speciﬁed weights on a problem s predictors, and one that is more sensitive to the contours of a problem s risk landscape.

Keywords: supervised learning, stability, metric geometry, optimal transport, risk landscape

1. Introduction

In machine learning, even before beginning work on a problem, we are often forced to accept discrepancies between the problem we want to solve and the problem we actually get to work with. We may put up with noise or bias in the data collection process which distorts our view of the true distribution of observations. We may replace a loss function with a surrogate loss whose computational cost or optimization properties are preferable. We may have access to only a small sample of observations. Such compromises are a necessary reality. This brings us to our primary motivating questions.

Question 1: How much can such a compromise change our problem and its descriptive features?

Question 2: How much eﬀect can multiple compromises have in conjunction? Can we guarantee that a sequence of small changes will not have drastic eﬀects on the problem to be solved?

c 2025 Facundo M emoli, Brantley Vose, Robert C. Williamson.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0322.html.

M emoli, Vose, Williamson

1.1 Overview of our approach.

In this paper, we provide a comprehensive framework with which to answer such questions. While many frameworks exist to answer Question 1, these methods are concerned with quantifying changes to one or two aspects of a problem at a time, limiting their ability to answer Question 2. In real world problems, multiple simultaneous corruptions and substitutions are unavoidable. Our framework is broad enough to handle simultaneous changes to many aspects of a problem. To begin, we give a precise deﬁnition of supervised learning problem and deﬁne a notion of distance, dubbed the Risk distance and denoted d R, by which to compare pairs of problems. This gives rise to the (pseudo)metric space of supervised learning problems. The Risk distance lets us make geometric sense of Questions 1 and 2; we can measure how much a compromise aﬀects a problem by seeing how far the problem moves under the Risk distance. To actually deﬁne the Risk distance d R, we draw on the wisdom of metric geometry. In 1975, Edwards constructed a metric on the set of isomorphism classes of compact metric spaces which came to be known as the Gromov-Hausdorﬀdistance (Edwards, 1975; Gromov, 1981). Similarly, the Gromov-Wasserstein distance, introduced by M emoli (2007, 2011), provides an optimal-transport-based metric on the collection of isomorphism classes of metric spaces equipped with probability measures. The Gromov-Hausdorﬀand Gromov-Wasserstein distances have become integral to the theory of metric geometry by facilitating a geometric understanding of spaces in which the points themselves are spaces. Inspired by this tradition, we craft the Risk distance in the image of the Gromov Wasserstein distance. This sets up the following analogy between supervised learning and metric geometry.

Supervised Learning Metric Geometry

Problem ÐÑ Metric measure space

Loss function ÐÑ Metric

Risk distance ÐÑ Gromov-Wasserstein distance

1.2 Previous approaches.

Question 1 has previously been explored for various speciﬁc kinds of changes to a problem. In our work, a supervised learning problem (or just problem, for short), is modeled as a 5-tuple P p X, Y, η, ℓ, Hq where X and Y are the input space and response space respectively, η is the joint law: a probability measure on X ˆ Y , ℓis the loss function and H is the predictor set: a collection of functions h : X Ñ Y .

Joint law η. A common assumption in machine learning is that the data is drawn according to an unknown underlying probability measure. Concepts such as noise and bias in data collection or data shift phenomena such as covariate shift, label shift, or concept drift (Huyen, 2022, Ch 8), can be described as changes to this underlying measure. The eﬀects of various kinds of noise (Zhu and Wu, 2004; Natarajan et al., 2013; Menon et al., 2015, 2018; Rooyen and Williamson, 2018; Iacovissi et al., 2023) and data shifts (Shimodaira, 2000) on supervised learning problems is a longstanding area of research. A geometric framework for understanding changes in probability

Geometry and Stability of Supervised Learning Problems

measures exists as well; information geometry seeks to understand spaces of probability measures from the viewpoint of Riemannian geometry (Amari, 2016; Ay et al., 2017).

Loss function ℓ. A theoretically attractive loss may have poor optimization properties, prompting one to replace it with a so-called surrogate loss. Alternatively, an attractive loss could be expensive to exactly compute, making it a candidate for approximation, like replacing a Wasserstein-based loss with a Sinkhorn-based loss in probability estimation (Cuturi, 2013). Replacing the loss represents a tradeoﬀbetween theoretical properties and computational eﬃciency, and the quantitative details of this tradeoﬀ have been explored in many contexts (Lin, 2004; Zhang, 2004; Bartlett et al., 2006; Steinwart, 2007; Awasthi et al., 2022; Mao et al., 2023).

Predictor set H. Universal approximation theorems, popular in deep learning, establish that certain classes of models can approximate large classes of predictors arbitrarily well (Pinkus, 1999). These can be seen as theorems which compare large, intractable predictor sets to those that can be produced by a given model, aiming to show that there is eﬀectively no diﬀerence between selecting a predictor from either set. Approximation theory more broadly is similarly concerned with the approximation power of function classes or transforms (Trefethen, 2019).

Input and response spaces X and Y . Modiﬁcations of the input and response spaces are implicit in many of the modiﬁcations listed above. Additionally, the process of feature engineering (Huyen, 2022, Ch 5), for which there is little established theory, can be seen as a modiﬁcation of the input space. Examples of output space modiﬁcation include the common relaxation of classiﬁcation to class probability estimation or, less common, the discretization of the response space into a ﬁnite set of labels (Langford and Zadrozny, 2005) which served as a motivation for the exact results by Reid and Williamson (2011).

Combinations. Our notion of a learning problem comprises ﬁve separate components. There are some existing results that explore how changes to some components aﬀect the others. Some of this work goes under the name of Machine Learning Reductions (Langford, 2009). There are results on how changes to the distribution of examples η (due to noise ) has the eﬀect of changing the loss function ℓ(with label noise) (Rooyen and Williamson, 2018) or the model class H (with attribute noise) (Bishop, 1995; Dhifallah and Lu, 2021; Rothfuss et al., 2019; Smilkov et al., 2017; Williamson and Cranko, 2024).

There is precedent for geometric frameworks regarding statistical learning. We have mentioned the rise of information geometry as a way to understand statistical manifolds. Also, much as we aim to do with supervised learning problems, Le Cam developed a notion of distance (Le Cam, 1964; Mariucci, 2016) between the statistical experiments of Blackwell (Blackwell, 1951). A modiﬁed discrepancy incorporating a loss function was introduced by Torgersen (1991, Complement 37). We will discuss the relationship between supervised learning problems and statistical experiments in Section 3.2. While much research has sought to quantify and control the eﬀects of various speciﬁc corruptions on supervised learning problems, no framework so far has been comprehensive

M emoli, Vose, Williamson

P P d Rp P, P 1q

Approximate loss

Restrict predictors

Figure 1: A simple depiction of how a sequence of changes can push a particular problem through the space of problems. An idealized problem P is modiﬁed by a sequence of four changes, each change moving the problem some distance in the space of problems. Once the sequence of corruptions has been applied, we call the resulting problem P 1. The original and corrupted problems are connected by a dashed line representing the risk distance between them.

enough to consider changes to all aspects of a problem. That is, while many have given answers to Question 1, to the best of our knowledge no attempt has been made at a uniﬁed answer to Question 2.

1.3 Our contributions.

We provide a geometric framework with which to answer both Questions 1 and 2 in the form of the Risk distance. Our primary tool for answering Question 1 with the Risk distance is the central topic of Section 4: stability results. A stability result is an upper bound on how much a certain change to a problem can move that problem under the Risk distance. That is, if we apply a certain change to a problem P to obtain a new problem P 1, such as a change to the loss function, the data collection process, or the set of potential predictors, a stability result is an upper bound on d Rp P, P 1q. Such a result acts as a quantitative guarantee that a small change to some aspect of the problem will not drastically change the problem as a whole. Comparing problems via a metric also presents advantages over more general kinds of dissimilarity scores: the triangle inequality lets one string together upper bounds on distances. To wit, let x0 be a point in a metric space p X, d Xq. Suppose we create a sequence of points x0, x1, . . . , xk by pushing each point xi 1 a distance of at most ε to get the next point xi in the sequence. Then the triangle inequality assures us that

d Xpx0, xkq ď

i 1 d Xpxi 1, xiq ď kε.

Geometry and Stability of Supervised Learning Problems

Figure 2: The space P of all supervised learning problems and a descriptor. A descriptor is modeled as a map s : P Ñ S whose codomain is some metric space p S, d Sq. A descriptor s is regarded as stable if there is some constant C ą 0 such that C d Rp P, Pq ě d Spsp Pq, sp P 1qq for all P, P 1 P P. The ﬁgure also depicts the case when S is the set R of non-negative reals and sp Pq is the constrained Bayes risk Bp Pq of P. In that case, one of our stability results (Theorem 37) establishes that d Rp P, P 1q ě ˇˇBp Pq Bp P 1q ˇˇ for all P, P 1 P P.

In other words, in a metric space, a short sequence of small pushes cannot add up to a drastic change. In the context of the Risk distance, this feature lets us handle multiple simultaneous changes to a problem by applying the triangle inequality and stringing together stability results. This gives an answer to Question 2. See Figure 1 for an illustration. By design, the Risk distance will be a pseudometric (that is, a metric for which distinct points can be distance zero apart), but the above virtues of metric spaces apply to pseudometric spaces as well. Along with stability of the Risk distance under various problem modiﬁcations, Section 4 will also demonstrate stability of certain descriptors of learning problems with respect to the Risk distance. In particular, we prove in Section 4.1 that a descriptor of a problem called its constrained Bayes risk is stable under the Risk distance. In other words, if the Risk distance between two problems P and P 1 is less than ε, then the optimal expected loss across all predictors for P is within ε of that of P 1; see Figure 2 for an illustration. We obtain this stability result by ﬁrst proving stability under the Risk distance of a stronger descriptor, called the loss proﬁle set of a problem, which describes the possible distributions of losses incurred by a random observation. We also demonstrate stability under the Risk distance of a descriptor from statistical learning theory known as the Rademacher complexity for arbitrarily large sample sizes. In addition to facilitating stability results, the Risk distance gives us a notion of convergence for supervised learning problems. Just as it is fruitful to discuss the convergence of sequences of functions, random variables, or probability distributions, so too can it be useful to discuss the convergence of a sequence of problems. In Section 5.4, we will demonstrate an analog of the Glivenko Cantelli theorem by proving that the empirical problem, a problem constructed using a ﬁnite random sample of observations instead of the true distribution

M emoli, Vose, Williamson

from which it is sampled, converges to the true problem under the Risk distance with probability 1. Lastly, the Risk distance lets us consider the geometry of the space of problems and consequences thereof. In Sections 5.1 and 5.2, we connect the geometric concept of a geodesic between two problems with the probabilistic notion of couplings of distributions associated with those problems. This connection lets us write down some explicit geodesics in the space of problems. In Section 5.3, we show that classiﬁcation problems, meaning those problems with a ﬁnite output space, are dense in a larger space of problems under the Risk distance. This density result proves useful when establishing the convergence of empirical problems. In the ﬁnal two sections, we also study two variants of the Risk distance. In Section 6, we deﬁne a parameterized family of distances, each called the Lp-Risk distance for p P r1, 8s. The Lp-Risk distance incorporates weights in the form of a speciﬁed probability measure on the set of possible predictors. The use of this additional data results in improved stability when compared to the Risk distance. In Section 7, we also deﬁne a stronger version of the Risk distance, called the Connected Risk distance, which is designed to be more sensitive to changes in the risk landscape of a problem. In particular, we show that a topological summary of the problem called its Reeb graph is stable under the Connected Risk distance.

1.4 Outlook

We regard the Risk distance as a tool to produce theoretical guarantees. To prove that problems are stable under a certain kind of change with respect to the Risk distance, or that certain descriptors are stable under the Risk distance, we need only bound the Risk distance above or below. The utility of the Risk distance lies not in explicit algorithms and associated computations, but in the theoretical landscape that it facilitates. The current landscape of machine learning is rapidly expanding, particularly in the space of problem formulations. Given the staggering growth of machine learning research (e.g., 27,000 submissions to Neur IPS 2025 (Neur IPS, 2025)), there is clear value in providing a conceptual framework to help organize and interpret this proliferation of methods. This paper takes an initial step toward an organizing map of this space namely, we endow the class of supervised learning problems with a metric that induces a geometric organization, so existing and emerging methods can be situated and compared within a common landscape. While this paper focuses on supervised learning, many generative models, such as GANs, are implicitly tied to supervised learning problems via divergence measures e.g., f-divergences which are in one-to-one correspondence with certain supervised objectives (Nowozin et al., 2016; Williamson and Cranko, 2024).

1.5 Other previous work.

Kleinberg s seminal paper (Kleinberg, 2002) initiated a thread of research into the formal study of clustering problems (Carlsson and M emoli, 2010; Carlsson and M emoli, 2013; Ackerman, 2012; Cohen-Addad et al., 2018). In a spirit similar to this thread, the present paper aims to formally examine supervised learning problems. The Risk distance introduced in this paper participates in a growing thread of research into optimal-transport-based distances, a thread which begins with the metrics discussed in Section 2.3 and runs through more recent variants including optimal transport distances

Geometry and Stability of Supervised Learning Problems

for graph-based data (Chowdhury and M emoli, 2019; Chen et al., 2022), matrices (Peyr e et al., 2016), partitioned data sets (Chowdhury et al., 2021), labeled data sets (Alvarez-Melis and Fusi, 2020), optimal transport of both data and features jointly (Vayer et al., 2020b), or of data with quite general structures (Vayer et al., 2020a; Patterson, 2020), as well as approximations (Scetbon et al., 2022; Sejourne et al., 2021; Vincent-Cuaz et al., 2021) and regularizations (Peyr e et al., 2016) thereof.

1.6 Statement of contribution

The genesis of this project dates back to 2010-2011 when F.M. visited R.W. at NICTA (Canberra, Australia). The formulation of the results, ﬁnding the proofs, and the vast majority of the writing of the paper was done by F.M. and B.V. R.W. contributed the initial formulation of the question to be addressed, provided some context from the ML literature, and reviewed and slightly revised the ﬁnal document.

1 Introduction 1 1.1 Overview of our approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Previous approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Our contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Other previous work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Statement of contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 8 2.1 Preliminary Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Pushforward Measures, Correspondences, and Couplings . . . . . . . . . . . 9 2.3 Metric Geometry and Optimal Transport . . . . . . . . . . . . . . . . . . . 11 2.3.1 HausdorﬀDistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Gromov-HausdorﬀDistance . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Wasserstein Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.4 Gromov-Wasserstein Distance . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Total Variation Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Markov Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Empirical Measures and Glivenko Cantelli Classes . . . . . . . . . . . . . . 15

3 The Risk Distance 16 3.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Comparison with Blackwell s Statistical Experiments . . . . . . . . . . . . . 20 3.3 Motivating Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 The Risk Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Connections to the Gromov-Wasserstein Distance . . . . . . . . . . . . . . . 29

M emoli, Vose, Williamson

4 Stability 30 4.1 Stability of the Constrained Bayes Risk and Loss Proﬁle . . . . . . . . . . . 30 4.2 Stability of Rademacher Complexity . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Stability under Sampling Bias . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4 Stability under Sampling Noise . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.5 Stability under Modiﬁcation of Predictor Set . . . . . . . . . . . . . . . . . 40

5 Geometry of the Space of Problems 43 5.1 Geodesics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Existence of Optimal Couplings and Correspondences . . . . . . . . . . . . 45 5.3 Density of Classiﬁcation Problems . . . . . . . . . . . . . . . . . . . . . . . 46 5.4 Convergence of Empirical Problems . . . . . . . . . . . . . . . . . . . . . . . 49

6 The Lp-Risk Distance: Weighting Problems 50 6.1 Weighting Predictor Sets and the Lp-Risk Distance . . . . . . . . . . . . . . 51 6.2 Connection to the Gromov-Wasserstein Distance . . . . . . . . . . . . . . . 53 6.3 Stability of Loss Proﬁle Distribution . . . . . . . . . . . . . . . . . . . . . . 54 6.4 Improved Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.5 Improved Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7 The Connected Risk Distance: Risk Landscapes 60 7.1 Insensitivities of the Risk distance . . . . . . . . . . . . . . . . . . . . . . . 61 7.2 The Connected Risk distance . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.3 Risk Landscapes and Reeb Graphs . . . . . . . . . . . . . . . . . . . . . . . 63

8 Discussion 66

A Relegated Proofs 68 A.1 Proofs and Details from Section 3 . . . . . . . . . . . . . . . . . . . . . . . . 68 A.1.1 Connecting the Risk Distance with the Gromov-Wasserstein Distance 71 A.2 Proofs from Section 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 A.3 Proofs from Section 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 A.4 Proofs from Section 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A.5 Proofs from Section 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

B Weak Isomorphism with Classiﬁcation Problems 87

2. Background

We begin with some necessary concepts and notation from measure theory, metric geometry, and optimal transport. We refer the reader to the standard reference of Villani (2003) for more on measure-theoretic and optimal transport concepts, and that of Burago, Burago, and Ivanov (2001) for metric geometry concepts.

Geometry and Stability of Supervised Learning Problems

2.1 Preliminary Notation

We use πi to represent a projection map X1 ˆX2 ˆ ˆXn Ñ Xi onto the ith component of a product. More generally, πi1,...,ik X1 ˆ ˆ Xn Ñ Xi1 ˆ ˆ Xik represents the projection onto coordinates i1, . . . , ik. When integrating a measurable function f against a measure µ, we will favor the notation ż

XˆY fpx, yq µpdx ˆ dyq

if f is a function of two variables, and similarly for more variables. We often suppress the domain of integration under the integral symbol since it is implicit in the measure µ. Rarely, when many variables are necessary and horizontal space is at a premium, we may instead use the notation ż fpx, yq dµpx, yq.

Given a measurable space p Z, ΣZq, we let Probp Zq denote the space of probability measures on p Z, ΣZq. For any z P Z, we let δz P Probp Zq denote the Dirac probability measure at z deﬁned, for A P ΣZ, as δzp Aq 1 whenever z P A and as 0 otherwise. Given a probability measure α P Probp Rq, we let meanpαq : ş

R t αpdtq denote its mean whenever it exists.

2.2 Pushforward Measures, Correspondences, and Couplings

We recall some basic ideas from measure theory and establish notation. Given a measure space p X, ΣX, µq, a measurable space p Y, ΣY q, and a measurable map f : X Ñ Y , the pushforward of µ through f is the measure on p Y, ΣY q deﬁned by

f7µp Aq : µpf 1p Aqq

for all A P ΣY . If µ is a probability measure, f7µ will be a probability measure as well. Hence f7 is a function Probp Xq Ñ Probp Y q. A pushforward has a probabilistic interpretation: if X is a random variable in a measurable space X with law µ, and f : X Ñ Y is measurable, then the random variable fp Xq has law f7µ. Given two sets X and Y , a subset R Ď X ˆ Y is called a correspondence if

for all x P X there is some y P Y with px, yq P R, and

for all y P Y there is some x P X with px, yq P R.

We use Cp X, Y q to denote the set of all correspondences between X and Y . A coupling between two probability measure spaces p X, ΣX, µXq and p Y, ΣY , µY q is a probability measure on X ˆ Y with marginals µX and µY . That is, a probability measure γ on X ˆ Y is a coupling if the pushforward measures satisfy pπ1q7γ µX, pπ2q7γ µY . We use ΠpµX, µY q to denote the set of all such measures. Couplings can be seen as a probabilistic counterpart to correspondences. The following are standard methods for constructing correspondences between two sets X and Y .

M emoli, Vose, Williamson

The entire product X ˆ Y is always a correspondence. In particular, this shows that Cp X, Y q cannot be empty.

If X Y , one can construct the diagonal correspondence tpx, xq|x P Xu.

If Y t u is the one point set, then there is a unique correspondence X ˆ t u.

The graph of any surjection f : X Ñ Y is a correspondence between X and Y .

Analogous constructions exist for couplings as well. Let p X, µXq and p Y, µY q be probability spaces.

The product measure µX b µY is always a coupling, sometimes called the independent coupling. This shows that ΠpµX, µY q cannot be empty.

Suppose p X, µXq p Y, µY q. If X : X Ñ X ˆ X denotes the diagonal map Xpxq px, xq, then the diagonal coupling is the pushforward measure p Xq7µX.

If Y t u is the one point set, then forcibly µY δ and there is a unique coupling µX b δ in this case.

Suppose f : X Ñ Y is a measure-preserving map, meaning f7µX µY . Let if : X Ñ X ˆ Y be the map sending X to the graph of f, meaning ifpxq : px, fpxqq. Then the pushforward pifq7µX is a coupling between µX and µY .

When we discuss Markov kernels in Section 2.5, we will mention one more method for generating couplings by combining a probability measure α P Probp Xq and a Markov Kernel β : X Ñ Probp Y q using reverse disintegration. We will make use of the following standard coupling result, which can be found in any standard optimal transport text, such as that of Villani (2003, Lemma 7.6). Note that a topological space is called Polish if it is separable and metrizable by a complete metric. Most familiar spaces are Polish, including Rn and open subsets thereof, manifolds with or without boundary, and ﬁnite discrete spaces. A Borel measurable space arising from a Polish topological space is called Polish as well.

Proposition 1 (The Gluing Lemma) Let µ1, µ2 and µ3 be probability measures on Polish spaces X1, X2, and X3 respectively. Let γ1,2 P Πpµ1, µ2q and γ2,3 P Πpµ2, µ3q. Then there exists a probability measure γ1,2,3 on X1 ˆ X2 ˆ X3 with marginals γ1,2 on X1 ˆ X2 and γ2,3 on X2 ˆ X3.

We call such a γ1,2,3 a gluing of γ1,2 and γ2,3. An analog of the gluing lemma for correspondences (see Burago et al., 2001, Exercise 7.3.26) instead of couplings is included below in order to highlight the parallels between couplings and correspondences.

Proposition 2 (The Gluing Lemma for Correspondences) Let A1, A2 and A3 be sets, and let R1,2 P Cp A1, A2q and R2,3 P Cp A2, A3q. Then there exists a subset R1,2,3 Ď A1 ˆ A2 ˆ A3 which projects to R1,2 on A1 ˆ A2 and R2,3 on A2 ˆ A3.

Geometry and Stability of Supervised Learning Problems

We call such an R1,2,3 a gluing of R1,2 and R2,3. Proof Selecting R1,2,3 tpx, y, zq P A1ˆA2ˆA3|px, yq P R1,2, py, zq P R2,3u provides one such subset.

While some measurable spaces can be quite poorly behaved (for instance, those that fail to exhibit a gluing lemma), the measurable spaces in this paper will all arise from relatively well-behaved topological spaces.

Assumption 3 We assume that any measurable space that appears will arise from a Polish topological space by equipping it with the Borel σ-algebra. Therefore, when convenient, we will suppress the σ-algebra in our notation for measurable spaces. That is, instead of writing a measurable space as a pair p X, ΣXq with X a Polish space and ΣX its Borel σ-algebra, we will simply write X with the understanding that the Borel σ-algebra is being used. We will similarly write measure spaces p X, ΣX, µq as pairs p X, µq.

Bearing Assumption 3 in mind, we deﬁne the support of the measure µ as the closed set

supprµs : tx P X | µp Uq ą 0 @ U open s.t. x P Uu.

2.3 Metric Geometry and Optimal Transport

Here we present four metrics from the ﬁeld of metric geometry. The following diagram (inspired by M emoli, 2011, Fig. 2) illustrates the relationships between the four metrics.

Gromov-Hausdorﬀ Wasserstein

Gromov-Wasserstein

Remove ambient space

Introduce probability measure

Introduce probability measure

Remove ambient space

We do not explicitly use the Gromov-Hausdorﬀor Gromov-Wasserstein distances in this paper. However, these distances provide context by illustrating common threads that run through all four metrics and is shared by the Risk distance. In all four of the metrics, objects are compared by optimally aligning the two objects, whether via correspondences or couplings, and numerically scoring the distortion incurred by an optimal alignment. The Risk distance (Deﬁnition 26) follows this same template. To emphasize the continuation of this thread, we will take a moment to compare the Risk distance to the Gromov-Wasserstein distance in Section 3.5. Note that we will often suppress the underlying metric in our notation, writing a metric space p X, d Xq as simply X.

2.3.1 Hausdorff Distance

The ﬁrst metric is the Hausdorﬀdistance. Introduced by Hausdorﬀ(1914), it provides a notion of distance between compact subsets of a ﬁxed metric space. Given two compact

M emoli, Vose, Williamson

subsets X, Y of a metric space p Z, d Zq, the Hausdorﬀdistance between them is typically deﬁned to be

d Z Hp X, Y q : max

sup x PX inf y PY d Zpx, yq, sup y PY inf x PX d Zpx, yq

To emphasize the commonalities among the four metrics of this section, we will use an alternative equivalent deﬁnition provided by M emoli (2011, Prop. 2.1) in terms of correspondences.

Proposition 4 Given two compact subsets X and Y of a metric space p Z, d Zq, the Hausdorﬀ distance between X and Y is equal to

d Z Hp X, Y q inf RPCp X,Y q sup px,yq PR d Zpx, yq.

2.3.2 Gromov-Hausdorff Distance

The second metric of this section is the Gromov-Hausdorﬀdistance, an extension of the Hausdorﬀdistance that removes the need for a common ambient metric space. While the Hausdorﬀdistance compares compact subsets of a common metric space Z, the Gromov Hausdorﬀdistance compares pairs of compact metric spaces.

Deﬁnition 5 Given two compact metric spaces p X, d Xq and p Y, d Y q, the Gromov-Hausdorﬀ distance between them is given by

d GHp X, Y q : inf RPCp X,Y q sup px,yq,px1,y1q PR

ˇˇd Xpx, x1q d Y py, y1q ˇˇ.

The function d GH was ﬁrst discovered by Edwards (1975, Def. III.3) and later rediscovered by its namesake Gromov (1981), both with alternative (but equivalent) deﬁnitions diﬀerent from the one given in Deﬁnition 5. The formulation above in terms of correspondences is due to Kalton and Ostrovskii (1997). This formula does indeed deﬁne a metric on the set of compact metric spaces up to isometry, meaning that d GHp X, Y q 0 if and only if X and Y are isometric (Burago et al., 2001, Thm. 7.3.30).

2.3.3 Wasserstein Distance

The third metric of this section is the Wasserstein distance.

Deﬁnition 6 (Kantorovich, 1942) Let p P r1, 8q. Given two probability measures µ, ν on a metric space p Z, d Zq, the p-Wasserstein distance between µ and ν is

d Z W,ppµ, νq : inf γPΠpµ,νq

XˆX dp Zpx, yq γpdx ˆ dyq 1{p .

Notice that to construct the Wasserstein distance, we can begin with the Hausdorﬀ metric and replace the correspondences with their probabilistic counterparts: couplings. We then replace the supremum with an Lp-style integral, and our job is done. Moving forward,

Geometry and Stability of Supervised Learning Problems

since we primarily make use of the 1-Wasserstein distance, we will refer to d W,1 simply as the Wasserstein distance. The Wasserstein distance is also called the earth mover s distance due to the following interpretation. Suppose p X, d Xq represents the ground of a garden equipped with the 2-d Euclidean metric, and µ describes the distribution of a pile of soil on X. Suppose that ν is a desired distribution of soil. Then a coupling γ P Πpµ, νq represents a plan for moving the soil from distribution µ to distribution ν, and d X W,1pµ, νq is the amount of work that an optimal plan will require. Indeed, the Wasserstein metric seems to have ﬁrst appeared in the work of Soviet mathematician and economist L. V. Kantorovich, who was studying optimal transportation of resources. For an English translation of Kantorovich s 1942 paper, see Kantorovich (2006), and for a historical survey of this often-rediscovered metric, see Vershik (2006).

2.3.4 Gromov-Wasserstein Distance

The fourth metric of this section is the Gromov-Wasserstein distance. Just as the Gromov Hausdorﬀmetric is an ambient-space-free variant of the Hausdorﬀdistance, the Gromov Wasserstein distance is an ambient-space-free variant of the Wasserstein distance. It can be used to compare probability measures regardless of the underlying metric space.

Deﬁnition 7 A metric measure space is a triple p X, d X, µXq such that p X, d Xq is a metric space, and µX is a Borel probability measure on X with full support.

For more on metric measure spaces, we direct the reader to Mikhael Gromov s foundational book (Gromov, 2007, Chapter 31

Deﬁnition 8 (M emoli (2007, 2011)) Let p P r1, 8q. The p-Gromov-Wasserstein distance between two metric measure spaces p X, d X, µXq and p Y, d Y , µY q is given by

d GW,pp X, Y q : inf γPΠpµX,µY q 1 2

ˇˇd Xpx, x1q d Y py, y1q ˇˇp γpdx ˆ dyqγpdx1 ˆ dy1q 1{p .

As with the Wasserstein distance, we will typically take p 1 and refer to d GW,1 simply as the Gromov-Wasserstein distance. While preceded by Gromov s box distance and observable distance (Gromov, 1981, Ch. 3 1

2), the ﬁrst to formulate a Wasserstein-type adaptation of the Gromov-Hausdorﬀ distance to the setting of metric measure spaces was Sturm (2006). The deﬁnition above was introduced and developed by M emoli (2007, 2011) in order to apply methods from optimal transport to the computational problem of data comparison and matching. Sturm (2023) extensively explored the geometry of the resulting space of spaces . The Gromov Wasserstein distance has spawned variants in many domains (some of which are referenced in Section 1) and applications to various problems in Machine Learning and Data Science such as unsupervised natural language translation (Alvarez-Melis and Jaakkola, 2018), imitation learning (Fickinger et al., 2022), generative modeling (Bunne et al., 2019), and to the study of Graph Neural Networks in the form of the Weisfeiler-Lehman distance and a corresponding version of the Gromov-Wasserstein distance for comparing labelled Markov processes (Chen et al., 2022). See also Peyr e et al. (2019); Vayer et al. (2020a, 2019); Peyr e et al. (2016);

M emoli, Vose, Williamson

Scetbon et al. (2022); Chowdhury and Needham (2020); Sejourne et al. (2021); Delon et al. (2022); Beier et al. (2022); Le et al. (2022); Dumont et al. (2024); Arya et al. (2024) for other work related to the Gromov-Wasserstein distance.

2.4 Total Variation Distance

There is one additional metric that we will use. The simplest method of comparing two probability measures on a common measurable space is the total variation distance. See Levin et al. (2017, Sec 4.1) for an introduction.

Deﬁnition 9 Let µ, ν be probability measures on the measurable space pΩ, Sq. The total variation distance between µ and ν is given by

d TVpµ, νq : sup BPS

ˇˇµp Bq νp Bq ˇˇ.

It is not actually necessary to take an absolute value in the deﬁnition of d TV, since the value |µp Bq νp Bq| is achieved by either µp Bq νp Bq or µpΩz Bq νpΩz Bq. Hence,

d TVpµ, νq sup BPS

µp Bq νp Bq .

Additionally, the total variation distance can be seen as the Wasserstein distance when Ωis equipped with the discrete metric. In other words,

d TVpµ, νq inf γPΠpµ,νq

ż 1x y γpdxˆdyq inf γPΠpµ,νq γ tpx, yq|x yu .

Note that the above equalities only make sense if the diagonal : tpx, xq|x P Ωu is measurable. This is true under Assumption 3; if Ωis a Polish space, then is closed in Ωˆ Ω.

2.5 Markov Kernels

Here we establish deﬁnitions and notation surrounding Markov kernels. See Klenke (2014, Ch 14.2) for a comprehensive introduction. Let p X, ΣXq and p Y, ΣY q be measurable spaces. A map M : X ˆ ΣY Ñ R is called a Markov kernel if Mpx, q P Probp Y q for all x P X, and Mp , Aq is measurable for all A P ΣY . One often writes Mpxqp Aq instead of Mpx, Aq and thinks of M as a map X Ñ Probp Y q. One can use Markov kernels to represent functions where the output is random; given an input x, the random output is described by the probability measure Mpxq. Markov kernels also go by the name communication channel in information theory since they can be used to model a channel in which random noise aﬀects the output. We will use Markov kernels to model noise in Sections 4.4 and 6.5. There are several standard methods for constructing Markov kernels. A probability measure can be seen as a simple Markov kernel. Given µ P Probp Xq, we can construct a Markov kernel from a singleton t u Ñ Probp Xq mapping ÞÑ µ. Measurable functions can be seen as Markov kernels as well. Given a measurable f : X Ñ Y , we can construct the deterministic Markov kernel induced by f, denoted δf, by setting δfpxq to be the point mass δfpxq.

Geometry and Stability of Supervised Learning Problems

We can construct more interesting Markov kernels by the process of disintegration. Let µ P Probp X ˆ Y q with X-marginal α. Given that X and Y are Polish spaces, there exists a Markov kernel β : X Ñ Probp Y q satisfying

µp A ˆ Bq ż

A βpxqp Bq αpdxq. (1)

This characterizes β up to an α-null set. We call such a β the disintegration of µ along X. The existence of such a β is shown in, for instance, (Bogachev, 2007, Theorem 10.4.8). Disintegration can be seen as a measure-theoretic formulation of conditional probability. Indeed, if X and Y are random variables in X and Y respectively with joint law µ, then the disintegration β of µ along X gives the conditional law of Y with respect to X. That is, βpxqp Bq Pp Y P B|X xq for all x P X and measurable subsets B Ď Y . One can also reverse the disintegration process. Just as a joint law of p X, Yq can be recovered from the law of X and the conditional law of Y on X, so too can we recover a coupling in a similar manner. Indeed, Equation (1) shows that the coupling µ can also be recovered from β and α. This is another common method for producing couplings; one selects a probability measure α P Probp Xq and a Markov kernel β : X Ñ Probp Y q and deﬁnes µ as in Equation (1). Given two Markov kernels M : X Ñ Probp Y q and N : Y Ñ Probp Zq, their Markov kernel composition M N : X Ñ Probp Zq is given by

M Npxqp Cq : ż

Y Npyqp Cq Mpxqpdyq

for all measurable C Ď Z. Given Markov kernels M : X Ñ Probp Y q and N : W Ñ Probp Zq, the independent product of Markov kernels M b N : X ˆ W Ñ Probp Y ˆ Zq is given by

M b Npx, wq : Mpxq b Npwq

where the b on the right hand side denotes the product of measures.

2.6 Empirical Measures and Glivenko Cantelli Classes

If a data set is sampled from an underlying probability measure µ, the sample itself can be modeled as a random measure called an empirical measure.

Deﬁnition 10 Given a probability space p Z, µq, the nth empirical measure of µ is the random measure given by

where Z1, . . . , Zn is an i.i.d. sample with law µ.

A practitioner, having access only to a random sample and not the true underlying law µ from which it is drawn, can nevertheless approximate µ with µn. One hopes that for large n, µn is similar enough to µ for practical purposes. For instance, one would hope that integrating against µ and µn yield similar results when n is large. One formalization of this hope taken from statistical learning theory is the notion of a Glivenko-Cantelli class.

M emoli, Vose, Williamson

Deﬁnition 11 Let p X, µq be a probability space, and let F be a collection of integrable functions X Ñ R. Let µn denote the empirical measure of µ. We say that F is a Glivenko Cantelli class with respect to µ if

ż fpxqµpdxq ż fpxqµnpdxq ˇˇˇˇ a.s. ÝÝÑ 0.

We say that F is a Universal Glivenko-Cantelli class if it is a Glivenko-Cantelli class with respect to every µ P Probp Xq.

See Van Der Vaart and Wellner (1996) for an overview of Glivenko-Cantelli classes and their relationships to other conditions in the theory of empirical processes. We will employ Glivenko-Cantelli classes in our analysis of the convergence of empirical problems (Section 5.4), and universal Glivenko-Cantelli classes in our discussion of Rademacher complexity (Section 4.2). The Glivenko-Cantelli condition may be illuminated by the following suﬃcient condition taken from the theory of empirical processes involving the popular Vapnik-Chervonenkis (VC) dimension. Given a family of subsets A of some set Z, the VC dimension of A is a number which measures the ﬂexibility of the family A. Brieﬂy, the VC dimension of A is the cardinality of the largest ﬁnite set B Ď Z such that t A X B|A P Au Powp Bq, where Pow denotes the power set. If there is no largest such n, we say that A has inﬁnite VC dimension. For an introduction to VC dimension and its place in the theory of statistical learning, see Devroye et al. (1996, Ch 12).

Deﬁnition 12 Let f : X Ñ R be any function. The subgraph of f is given by

sgpfq : tpx, tq|t ă fpxqu Ď X ˆ R.

Proposition 13 Let F be a collection of measurable functions X Ñ R that is uniformly bounded below and above. Suppose F is endowed with a Polish topology, and subsequently the Borel σ-algebra, such that the map pf, xq ÞÑ fpxq is jointly measurable in f and x. If the collection of subgraphs tsgpfq|f P Fu has ﬁnite VC dimension, then F is a universal Glivenko-Cantelli class.

Proposition 13 follows from standard results in the theory of empirical processes which can be found in, for instance, the reference by Van Der Vaart and Wellner (1996). Speciﬁcally, we use a combination of Theorems 2.6.7 and 2.4.3 along with Example 2.3.5 from that source.

3. The Risk Distance

In this section, we introduce the points (problems) and distance function (the Risk distance) which will together deﬁne our pseudometric space of supervised learning problems.

3.1 Problems

Before we proceed, we must solidify the informal notion of supervised learning problem into a formal deﬁnition. Supervised learning problems come in many forms, including

Geometry and Stability of Supervised Learning Problems

regression, classiﬁcation, and ranking problems. We would like a deﬁnition that is general enough to encompass all of these examples. In all of these problem types, the goal is to examine the relationship between a random variable in an input space X and another in an response space Y . One then aims to select an appropriate function h : X Ñ Y that predicts the output value given an input. The eﬀectiveness of h is measured using a chosen loss function. (As an aside, even unsupervised problems, such as generating samples from a distribution, can have a supervised problem under-the-hood ; for example, f-GANS rely upon judging suitability of a distribution by use of the f-divergence (Nock et al., 2017), which is intimately related to a Bayes risk (Reid and Williamson, 2011)).

Deﬁnition 14 A supervised learning problem, or simply problem, is a 5-tuple

P p X, Y, η, ℓ, Hq

X and Y are Polish topological spaces called the input space and response space respectively,1

η is a Borel probability measure on X ˆ Y called the joint law,

ℓis a measurable function Y ˆ Y Ñ Rě0 called the loss function,

and H is a family of measurable functions X Ñ Y , called the predictors.2

We require each h P H to exhibit ﬁnite expected loss, meaning ż

XˆY ℓphpxq, yq ηpdxˆdyq ă 8.

Notice that our deﬁnition does not explicitly mention the random variables in the input and output spaces. Only the most crucial information about the variables, the joint law η that couples them, is recorded. This strengthens the analogy with metric geometry; a supervised learning problem is analogous to a metric measure space, where the metric corresponds to the loss function and the measure corresponds to the joint law. The expected loss mentioned in Deﬁnition 14 is more commonly known as the risk of a predictor.

1. The Polish condition is put in place to avoid pathologies. In all of the examples we have in mind, X and Y are either open, closed, or convex subsets of Rn, or else discrete sets. Such spaces are Polish. This should also help assuage the reader with set-theoretic concerns. We will speak freely of the collection of all problems . While this collection is technically too large to be a set, restricting ourselves to the above examples resolves this issue. 2. For greater generality, one could alternatively deﬁne a supervised learning problem such that the predictors h P H are functions into some decision space D which is not necessarily the same as Y . That is, one could deﬁne a problem to be a 6-tuple p X, Y, η, ℓ, H, Dq, where H is a collection of functions h : X Ñ D and ℓis a function ℓ: D ˆ Y Ñ Rě0. Deﬁnition 14 would then cover the special case D Y . The more general deﬁnition would, for instance, encompass class probability estimation problems by taking D Probp Y q. Many of the results in this paper would generalize to this more general setting. We leave this investigation to future work.

M emoli, Vose, Williamson

Deﬁnition 15 Given a problem P p X, Y, η, ℓ, Hq, the risk of a predictor h P H is deﬁned to be

RP phq : ż ℓphpxq, yq ηpdx ˆ dyq.

The constrained Bayes risk of P is deﬁned to be

Bp Pq : inf h PH RP phq inf h PH

ż ℓphpxq, yq ηpdx ˆ dyq.

The risk of a predictor is its expected loss on a randomly selected observation. If we measure the performance of a predictor by its risk, then the constrained Bayes risk of a problem represents the optimal possible performance when selecting a predictor from H. The constrained Bayes risk is so called by Williamson and Cranko (2024) since it is a modiﬁcation of the classical notion of the Bayes risk of a predictor. Note that if the predictor set for a problem P is the set HX,Y of all measurable functions f : X Ñ Y , then Bp Pq agrees with the classical Bayes risk, which we will henceforth denote as

B p Pq : inf h PHX,Y RP phq.

We emphasize that a problem does not include a method for choosing a predictor. We think of a problem as something that a practitioner must solve by selecting, through some algorithmic procedure, a function from the predictor set that performs well according to some useful functional built on the loss function, such as the risk RP : H Ñ Rě0. In this regard, in Section 7 we consider a certain compressed representation of RP : H Ñ Rě0 which permits gaining insight into the complexity associated to this selection task.

Notation 16 To avoid tediously re-writing these 5-tuples, we establish some persistent notation. We assume, unless otherwise stated, that a problem with the name P has components p X, Y, η, ℓ, Hq, a problem named P 1 has components p X1, Y 1, η1, ℓ1, H1q, and a problem named P 2 has components p X2, Y 2, η2, ℓ2, H2q. Similarly, for any integer i ě 0, the problem Pi has components p Xi, Yi, ηi, ℓi, Hiq.

Notation 17 Given any problem P p X, Y, η, ℓ, Hq, we use the shorthand ℓh for the function ℓh : X ˆ Y Ñ Rě0 given by ℓhpx, yq : ℓphpxq, yq.

Example 18 Consider the problem of predicting whether a college student will successfully graduate given their high school GPA and ACT scores. We can model this problem as follows. We represent each student with a point in R2 ˆ t0, 1u, where the coordinates in R2 are determined by the student s GPA and ACT score, and their graduation status is represented with a 0 for failure and 1 for success. We think of students as being sampled from an unknown probability measure η on R2 ˆ t0, 1u. Our aim is then to select a predictor R2 Ñ t0, 1u. We might restrict ourselves to predictors with a linear decision boundary:

ha,b,cpx1, x2q :

# 1 ax1 bx2 ě c 0 ax1 bx2 ă c .

Geometry and Stability of Supervised Learning Problems

We may also decide that all incorrect predictions carry the same penalty. Any incorrect guess incurs a loss of 1, while a correct guess has loss 0. We have now constructed a problem

P : p R2, t0, 1u, η, ℓ, Hq,

where ℓpy, y1q 1y y1, and H tha,b,c ˇˇa, b, c P Ru. The risk of an arbitrary predictor ha,b,c is then

RP pha,b,cq ż ℓpha,b,cpxq, yq ηpdxˆdyq

R2ˆt0u ha,b,cpxq ηpdxˆdyq ż

R2ˆt1u p1 ha,b,cpxqq ηpdxˆdyq

η x P R2ˇˇha,b,cpxq 1 ( ˆ t0u η x P R2ˇˇha,b,cpxq 0 ( ˆ t1u .

That is, the risk of ha,b,c is its misclassiﬁcation probability. The constrained Bayes risk Bp Pq is then the inﬁmal misclassiﬁcation probability across all predictors with linear decision boundaries. Under additional normality conditions, the predictor achieving the risk Bp Pq can be found with the Fisher linear discriminant (Fisher, 1936).

Example 19 Suppose we are faced with a linear regression problem. That is, we have a random vector X in Rn and a random number Y in R and are tasked with ﬁnding an aﬃne linear functional Rn Ñ R that most closely resembles their joint law. We can formalize this problem as follows. We let η P Probp Rn ˆ Rq be the unknown joint law. We may select the squared diﬀerence ℓpy, y1q : py y1q2 as our loss function. The resulting problem is then

P : p Rn, R, η, ℓ, Hq

where H is the collection of aﬃne linear functions Rn Ñ R. The risk of an arbitrary predictor hpxq a T x b is the mean squared error of h:

RP phq ż ℓphpxq, yq ηpdxˆdyq ż pa T x b yq2 ηpdxˆdyq.

The constrained Bayes risk Bp Pq is then the inﬁmal mean squared error across all linear functionals h P H.

Example 20 By making trivial choices for each component, one can deﬁne the one-point problems to be

P pcq : pt u, t u, δp , q, c, t Id uq

for any c ě 0. We will most often set c 0 and use the notation P : P p0q. The constrained Bayes risk Bp P pcqq is easily seen to be c, since the loss function of P pcq is identically c.

M emoli, Vose, Williamson

3.2 Comparison with Blackwell s Statistical Experiments

Our deﬁnition of a supervised learning problem may seem superﬁcially similar to the established notion of a statistical experiment which is prevalent in mathematical statistics. Since there already exist established quantitative methods for comparing statistical experiments such as Le Cam s deﬁciency distance (Le Cam, 1964; Mariucci, 2016), let us take a moment to contrast statistical experiments and supervised learning problems. Originally deﬁned by Blackwell (Blackwell, 1951), a statistical experiment consists of

a measurable space X,

a set of probability measures t Pθ, θ P Θu indexed by some set Θ, and

a distinguished θ P Θ, thought to be the true parameter, or the parameter chosen by nature .

A practitioner is then tasked with estimating the true value of θ and hence the underlying probability measure. Supervised learning can be placed into Blackwell s framework (c.f. Williamson and Cranko, 2024) by taking X to be the input space of the problem, taking Θ to be the response space Y , and letting the map y ÞÑ Py be the disintegration of the joint law η P Probp X ˆ Y q along Y .3 When predicting the label for a particular observation x, the true value y0 represents the unknown true response, and we assume the observed input x was drawn from the probability measure Py0. One s goal, of course, is to then estimate y0 using an observed x P X. A process for doing so deﬁnes a predictor X Ñ Y . Hence the superivised learning problem is encoded as a statistical experiment E : p X, t Py|y P Y u , y0q. In contrast, our supervised learning problems have three additional pieces of information:

(1) the Y -marginal of η,

(2) a loss function ℓon the response space Y , and

(3) an explicit family H of functions X Ñ Y .

Indeed, using the Y -marginal of η and the disintegration y ÞÑ Py, we can recover the joint probability measure η P Probp X ˆ Y q via reverse-disintegration (see Section 2.5 for details). Hence from the statistical experiment and the listed additional information, one can recover all ﬁve components of the supervised learning problem p X, Y, η, ℓ, Hq.

3.3 Motivating Properties

Our goal is to quantitatively compare problems by developing a notion of distance on the collection of all problems. This notion will take the form of a pseudometric d R which we will call the Risk distance. To design such a distance, we might begin by deciding on some desirable properties. When we introduce d R, we will see that it is characterized as the largest pseudometric satisfying these conditions. The ﬁrst property that one needs to decide when designing a pseudometric is what its zero set should be. In other words one needs to postulate a notion of isomorphism that

3. In the context of classiﬁcation, the measure Py is called the class-conditional distribution.

Geometry and Stability of Supervised Learning Problems

we expect the notion of distance to be compatible with. We start by introducing one such notion.

Deﬁnition 21 Two problems P and P 1 are said to be strongly isomorphic if there exist measurable bijections f1 : X1 Ñ X f2 : Y 1 Ñ Y

and a bijection between predictor sets

such that the following conditions are satisﬁed:

1. pf1 ˆ f2q7η1 η,

2. For every h1 P H1, the diagram

commutes. That is, ℓ1 h1 ℓϕph1q pf1 ˆ f2q.

Here the notation f1 ˆ f2 denotes the product map: pf1 ˆ f2qpx1, y1q : pf1px1q, f2py1qq. (We choose the notation f1 ˆ f2 over the alternative pf1, f2q because the former emphasizes that each component of the output depends only on the corresponding component of the input.) In other words, a strong isomorhism consists of bijections between the components of a problem that preserve the structure of the problem, and in particular preserve the performance of every predictor; if f1px1q x, f2py1q y and φph1q h, then ℓ1 h1px1, y1q ℓhpx, yq. Despite being natural, the notion of strong isomorphism is rather rigid. For instance, Condition 2 requires that an equality involving the predictors and loss hold for every point in X1 ˆ Y 1, even those outside of the support of η1. Such points could reasonably be considered irrelevant to the supervised learning problem, since an observation will appear there with probability 0. In order to motivate a more ﬂexible notion of isomorphism which characterizes the pairs of problems that should be distance zero apart, we collect potential modiﬁcations to problems that one might reasonably consider immaterial.

Example 22 Consider the linear regression problem P p Rn, R, η, ℓ, Hq from Example 19. We chose H to be the set of all aﬃne linear functionals Rn Ñ R. As is often done, we could alternatively formulate the linear regression problem as follows. We place our data in Rn 1

by setting the ﬁrst coordinate of every observation to a constant 1. Instead of an aﬃne linear map, our task is now to select a (non-aﬃne) linear map Rn 1 Ñ R. Formally, we construct a problem P 1 : p Rn 1, R, i7η, ℓ, H1q

M emoli, Vose, Williamson

i : Rn Ñ Rn 1 is given by ipx1, . . . , xnq : p1, x1, . . . , xnq, and

H1 is the set of all (non-aﬃne) linear functionals Rn 1 Ñ R.

The problems P and P 1 represent two common formulations of linear regression, the latter stated in terms of linear functionals, the former in terms of aﬃne linear functionals. While P and P 1 are not strongly isomorphic, the formulations they represent are considered equivalent. Whatever our notion of distance between problems, it is desirable that P and P 1 be distance zero apart in order to reﬂect this equivalence.

Example 23 Consider the problem

P : p R, ta, bu, η, ℓ, Hq,

η is any probability measure on R ˆ ta, bu.

ℓis the 0-1 loss (also called the discrete metric) on ta, bu. Namely, ℓpa, aq ℓpb, bq 0 and ℓpa, bq ℓpb, aq 1.

H consists of functions R Ñ ta, bu with ﬁnitely many discontinuities.

This P represents a classiﬁcation problem; a predictor h P H assigns to each input either the label a or the label b. Now construct a second problem P 1 by adding to P a third label, say, c. We design the remaining components of P 1 so as to make the labels b and c functionally identical. More precisely, we deﬁne

P 1 : p R, ta, b, cu, η1, ℓ1, H1q

η1 is any measure such that η is the pushforward of η1 under the map ta, b, cu Ñ ta, bu sending a ÞÑ a and b, c ÞÑ b.

ℓ1 extends ℓby setting ℓ1pb, cq ℓ1pc, bq ℓ1pc, cq 0

ℓ1pa, cq ℓ1pc, aq 1.

That is, ℓ1 considers b and c to be indistinguishable.

H1 consists of all functions R Ñ ta, b, cu with ﬁnitely many discontinuities.

The problems P and P 1 are both classiﬁcation problems. By design, P 1 is P with an extraneous label. If we solve P by choosing a function h P H, the same h will provide a solution to P 1 with identical performance. Conversely, if we solve P 1 by selecting some h1 P H1, we can produce an identically-performing solution for P by composing h1 with the map sending a ÞÑ a and b, c ÞÑ b. While P and P 1 are not strongly isomorphic, they diﬀer only superﬁcially, and a meaningful distance should set P and P 1 to be distance zero apart.

Geometry and Stability of Supervised Learning Problems

Examples 22 and 23 demonstrate that any of the ﬁve components of a problem can change without modifying the problem s essential nature. This motivates the following deﬁnition which arises as a relaxation of Deﬁnition 21.

Deﬁnition 24 Given two problems P and P 1, we say that P 1 is a simulation of P (or that P 1 simulates P), and denote it by P 1 ù P, if there exist measurable functions

f1 : X1 Ñ X f2 : Y 1 Ñ Y

such that the following are satisﬁed.

1. pf1 ˆ f2q7η1 η,

2. For every h P H, there exists an h1 P H1 such that the diagram

commutes (that is, ℓ1 h1 ℓh pf1 ˆ f2q) η1-almost everywhere. Similarly, for every h1 P H1, there exists an h P H such that the above diagram commutes η1-almost everywhere.

Note that if P and P 1 are strongly isomorphic then P and P 1 are simulations of each other. It is, however, clear that the notion of simulation is weaker than strong isomorphism. Indeed, the former does not require the maps f1 and f2 to be bijections, nor does it require Condition 2 to be satisﬁed through a bijection between H and H1. Furthermore, Condition 2 encodes an additional, more subtle, relaxation of the second condition of strong isomorphism: whereas the latter requires the diagram to commute everywhere on X1 ˆ Y 1, the former only requires this to take place η1-almost everywhere within X1 ˆ Y 1. In Example 22, the problem P 1 is a simulation of P via the projection map Rn 1 Ñ Rn that drops the ﬁrst coordinate, and the identity map on the output space R. Likewise, in Example 23, P 1 is a simulation of P via the identity map on the input space R and the map on the output space sending a ÞÑ a and b, c ÞÑ b. The term simulation is meant to suggest that when a problem P 1 is a simulation of P, then P 1 is, in a sense, richer than P. The terminology is inspired by similar concepts appearing in the study of automata and Markov chains (Larsen and Skou, 1989; Milner, 1980, 1989). Motivated by Examples 22 and 23, we posit that if P 1 is a simulation of P, then it is reasonable to consider P and P 1 to be functionally identical. As these examples demonstrate, if P 1 simulates P, then P 1 can be solved in terms of P using the maps f1 and f2 which witness the simulation. That is, one could take an i.i.d. sample tpx1 i, y1 iqun i 1 drawn from η1

and apply the maps f1 and f2 to the xi s and yi s respectively to get a sample tpxi, yiqun i 1 in X ˆ Y . The pushforward condition in the deﬁnition of a simulation guarantees that

M emoli, Vose, Williamson

the transformed sample will be an i.i.d. sample drawn from η. One could then solve P by using this data to select an appropriate predictor h P H, then solve P 1 with a corresponding h1 P H1. Point 2 in the simulation deﬁnition implies that ℓ1ph1px1 iq, y1 iq ℓphpxiq, yiq. In other words, we are guaranteed that h and h1 have identical performance on corresponding observations. For instance, both problems have the same constrained Bayes risk. Indeed, if P 1 is a simulation of P via the maps f1 : X Ñ X1 and f2 : Y Ñ Y 1, then

Bp P 1q inf h1PH1

ż ℓ1 h1px1, y1q η1pdx1ˆdy1q

ż ℓhpf1px1q, f2py1qq η1pdx1ˆdy1q

ż ℓhpx, yq ppf1ˆf2q7η1qpdxˆdyq

ż ℓhpx, yq ηpdxˆdyq Bp Pq.

This sort of relationship between problems that allows one to solve one problem in terms of another is reminiscent of John Langford s Machine Learning Reductions research program (Langford, 2009). If P 1 is a simulation of P, we would like our distance d R to consider them identical.

Condition 1: If P 1 is a simulation of P, then d Rp P, P 1q 0.

Symmetrizing the simulation relationship produces a more general relationship which we call weak isomorphism, modeled after weak isomorphism of networks, deﬁned by Chowdhury and M emoli (2019); Chowdhury and M emoli (2023).

Deﬁnition 25 Deﬁne two problems P and P 1 to be weakly isomorphic if there exists a third problem r P that is a simulation of both P and P 1.

In other words, weak isomorphism between problems P and P 1 arises from the existence of a simultaneous simulation of both problems, as illustrated via the diagram4

It is clear that strong isomorphism implies weak isomorphism. As we will see below (Theorem 27 and Proposition 30), under fairly general conditions, the Risk distance d Rp P, P 1q will vanish if and only if P and P 1 are weakly isomorphic.

4. Simulations, weak isomorphisms and their connections to the Risk distance suggest the existence of a category of problems in which simulations are morphisms, weak isomorphisms are spans, and from which the Risk distance can be derived. We leave the exploration of such a categorical perspective to future work.

Geometry and Stability of Supervised Learning Problems

The second desirable condition for d R is simpler to state. Let P be a problem, and suppose that P 1 is a problem with all the same components, except possibly for the loss function ℓ1. If there exists α ě 0 such that, for all h P H, we have ż

ˇˇℓphpxq, yq ℓ1phpxq, yq ˇˇˇ ηpdx ˆ dyq ď α,

then we demand d Rp P, P 1q ď α.

Condition 2: If P p X, Y, η, ℓ, Hq

P 1 p X, Y, η, ℓ1, Hq,

then d Rp P, P 1q ď sup h PH

ˇˇˇℓphpxq, yq ℓ1phpxq, yq ˇˇˇ ηpdx ˆ dyq.

3.4 The Risk Distance

The following is a pseudometric that satisﬁes Conditions 1 and 2.

Deﬁnition 26 Let P and P 1 be problems. For any coupling γ P Πpη, η1q and correspondence R P Cp H, H1q, deﬁne the risk distortion of γ and R to be

dis P,P 1p R, γq : sup ph,h1q PR

ż ˇˇℓphpxq, yq ℓ1ph1px1q, y1q ˇˇ γpdxˆdyˆdx1ˆdy1q.

The Risk distance between P and P 1 is then given by

d Rp P, P 1q : inf γPΠpη,η1q RPCp H,H1q

dis P,P 1p R, γq.

The risk distortion (and subsequently the Risk distance) is so named because of its similarity to the risk of a predictor. If one believes that the most salient numerical descriptor of a predictor is the loss that it incurs, then it is natural to describe a predictor by its loss on an average observation. It is similarly natural to compare predictors by comparing their respective losses for an average observation. The idea behind the Risk distance is that we should consider two problems P and P 1 to be similar if we can ﬁnd a correspondence between the predictor sets and a coupling between the underlying spaces such that pairs of corresponding predictors incur similar losses on corresponding observations. Notably, the Risk distortion and Risk distance are not invariant to scaling and shifting of the loss functions. While we often think of loss substitutions such as ℓpy, y1q ÞÑ aℓpy, y1q b for a, b P R, a ą 0 as immaterial modiﬁcations in supervised learning, such changes do, in some sense, aﬀect the problem. For instance, such a substitution will change the Bayes risk of the problem, so any notion of distance which controls the Bayes Risk (see Theorem 37 below) cannot be invariant to such modiﬁcations. In order to use the Risk distance on speciﬁc

M emoli, Vose, Williamson

problems, one must select loss functions which are meaningful in their speciﬁc context. For instance, one may elect to normalize their loss functions by subtracting the Bayes risk Bp Pq or the minimal possible loss miny,y1PY ℓpy, y1q, or by scaling the loss to have meaningful units such as bits or nats. Our formalism supports such decisions. By design, the Risk distance is not a proper metric, since Condition 1 implies that distinct problems could be distance zero apart. This is desirable behavior. Reformulating a problem in an equivalent way as in Example 22 or adding extraneous labels as in Example 23 should not move a problem at all under the Risk distance since such modiﬁcations do not meaningfully change the problem.

Theorem 27 The function d R is a pseudometric on the collection P of all problems. Furthermore, d Rp P, P 1q 0 whenever P and P 1 are weakly isomorphic.

That d R is pseudometric entails that d R vanishes on the diagonal, that is d Rp P, Pq 0 for all problems P, satisﬁes symmetry, that is d Rp P, P 1q d Rp P 1, Pq for all problems P and P 1, and satisﬁes the triangle inequality: d Rp P, P 2q ď d Rp P, P 1q d Rp P 1, P 2q for all problems P, P 1 and P 2. That d R satisﬁes the triangle inequality is nontrivial to prove, and the proof is relegated to Appendix A. Not only does d R satisfy Conditions 1 and 2, it is characterized by them in the sense that it is uniquely determined by these conditions.

Theorem 28 The Risk distance satisﬁes Conditions 1 and 2. That is,

1. If P 1 is a simulation of P, then d Rp P, P 1q 0.

2. For any problems P p X, Y, η, ℓ, Hq and P 1 p X, Y, η, ℓ1, Hq which diﬀer only in their loss functions, we have

d Rp P, P 1q ď sup h PH

ˇˇˇℓphpxq, yq ℓ1phpxq, yq ˇˇˇ ηpdx ˆ dyq.

Furthermore, the Risk distance is the largest pseudometric simultaneously satisfying Condition 1 and Condition 2.

Note that the claim from Theorem 27 that d Rp P, P 1q 0 whenever P and P 1 are weakly isomorphic can now be obtained from Theorem 28 as follows. By Deﬁnition 25, there exists a problem P 2 which is is joint simulation of both P and P 1. The claim can now be obtained through the facts that d R satisﬁes both the triangle inequality and Condition 1 so that

d Rp P, P 1q ď d Rp P, P 2q d Rp P 1, P 2q 0.

The statement of Theorem 28 implicitly asserts uniqueness of the largest such pseudometric. Indeed, if d1 and d2 were two diﬀerent pseudometrics satisfying Conditions 1 and 2, their pointwise maximum d1 : maxpd1, d2q would also satisfy Conditions 1 and 2, would be a pseudometric, but would be larger than d1 and d2.5 The proof of this theorem can also be found in Appendix A. While Theorem 27 states that weakly isomorphic problems are distance zero apart, we can say more; weak isomorphism is equivalent to the existence of a coupling and correspondence between the problems with zero risk distortion.

5. This would follow from the assumption that there exist P, P 1 P P such that d1p P, P 1q d2p P, P 1q.

Geometry and Stability of Supervised Learning Problems

Proposition 29 Let P and P 1 be problems. Then P P 1 if and only if there exist a correspondence R P Cp H, H1q and a coupling γ P Πpη, η1q such that

dis P,P 1p R, γq 0.

The proof is in Appendix A. Note that if dis P,P 1p R, γq 0 then d Rp P, P 1q 0 by deﬁnition of the Risk distance, but the converse does not necessarily hold; it is possible that the inﬁmal risk distortion is not achieved by any particular coupling and correspondence. Hence weak isomorphism between problems P and P 1 is generally stronger than saying d Rp P, P 1q 0. We will discuss the existence of such optimal couplings and correspondences in Section 5.2, where we will prove that, under certain conditions, weak isomorphism is equivalent to identiﬁcation under the Risk distance. While the statement (Corollary 61) is in terms of deﬁnitions we have not yet established, it is equivalent to the following:

Proposition 30 Let P p X, Y, η, ℓ, Hq and P 1 p X1, Y 1, η1, ℓ1, H1q be problems. Suppose that for all h P H, h1 P H1, the functions ℓh and ℓ1 h are continuous almost everywhere (with respect to η and η1 respectively), and the sets of functions

tℓh|h P Hu Ď L1pηq and ℓ1 h1 ˇˇh1 P H1( Ď L1pη1q

are compact under their respective L1 norms. Then d Rp P, P 1q 0 if and only if P P 1.

The continuity condition is satisﬁed by a typical regression problem, such as Example 19, since the predictors and loss for such a problem tend to be continuous. A classiﬁcation problem can satisfy the continuity condition as well under some assumptions on η. In particular, we need only assume that a random observation falls on a decision boundary with probability zero. We will discuss the continuity condition in more detail in Section 5.2. The compactness condition is implied if one assumes that a problem s predictor set H is parameterized by some compact parameter space Θ Ñ H and that the composition Θ Ñ H Ñ L1pηq is continuous. Consider, for example, the problem P p R, t0, 1u, η, ℓ, Hq where ℓis the 0-1 loss and H is the set of indicators 1pa,8q for a in the compact interval r 8, 8s. That is, P is a binary classiﬁcation problem for which the predictors are given by hard cutoﬀvalues. Under modest assumptions on η (namely that its ﬁrst marginal is absolutely continuous with respect to the Lebesgue measure), the composition r 8, 8s Ñ H Ñ L1pηq is continuous, and hence P satisﬁes the compactness condition. Under the same assumptions, we also have that each ℓh is η-almost everywhere continuous. Hence P satisﬁes both conditions of Proposition 30. While classiﬁcation problems are more likely to satisfy the compactness condition of Proposition 30, regression problems can be made to do so as well. Consider the linear regression problem in Example 19. The predictor set for that problem is parameterized by the non-compact space Rn 1. However, it can be made compact with regularization. Indeed, regularized regression methods like ridge regression and lasso can be seen as restrictions of the parameter space to a compact region A Ă Rn 1 (James et al., 2021, Sec 6.2). Linear regression problems with such restrictions will satisfy both the continuity and compactness conditions. Adding a constant α ą 0 to the loss function of a problem moves it a distance of α under the Risk distance. That is, if P is a problem with loss function ℓ, and P 1 is the same

M emoli, Vose, Williamson

problem but with loss function ℓ α, then d Rp P, P 1q α. This may appear to be undesirable behavior, since the losses ℓand ℓ α are practically identical from an optimization point of view. However, adding the constant α to the loss function also increases the constrained Bayes risk of the problem by α. From this perspective, it is reasonable to expect the distance between the two problems to be α.

Example 31 Let P be an arbitrary problem and P P p0q be the one-point problem of Example 20. There is only one correspondence in Cp H, t Id uq and one coupling in Πpη, δp , qq, so one can compute

d Rp P, P q dis P,P H ˆ t Idt uu, η b δp , q

ż ℓphpxq, yq ηpdx ˆ dyq.

Note the similarity with the constrained Bayes risk Bp Pq. While Bp Pq is equal to the risk of an optimal predictor, the distance d Rp P, P q is equal to the risk of a pessimal predictor. If we additionally assume that the loss ℓis bounded above by a constant ℓmax, one can similarly compute the distance from P to the one-point problem P pℓmaxq to be

d Rp P, P pℓmaxqq sup h PH

ż ℓmax ℓphpxq, yq ηpdx ˆ dyq

ℓmax inf h PH

ż ℓphpxq, yq ηpdx ˆ dyq

ℓmax Bp Pq.

Hence Bp Pq ℓmax d Rp P, Ppℓmaxqq, showing that the constrained Bayes risk can be written in terms of the Risk distance if the loss is bounded.

Remark 32 The Risk distance is not designed to be explicitly computed for arbitrary pairs of problems. Much like the Gromov-Hausdorﬀdistance, its usefulness lies in the stability results which it facilitates. However, for completeness, we brieﬂy remark on computational considerations. Given that the computation of the structurally similar but seemingly simpler Gromov-Hausdorﬀdistance is known to be NP-hard (Agarwal et al., 2015; Schmiedl, 2017; M emoli et al., 2023), we expect that estimation of the exact Risk distance for problems with ﬁnite input and output spaces is also NP-hard in general. In this direction, we will soon prove, as Corollary 34, that deciding whether d Rp P, P 1q 0 for arbitrary problems P and P 1

with ﬁnite input and output spaces is at least as hard as graph isomorphism.

In later sections, we will also explore two variants of the Risk distance that are constructed by tweaking the treatment of the correspondences Cp H, H1q in Deﬁnition 26. First, by endowing the predictor sets H and H1 with probability measures λ and λ1, one can replace Cp H, H1q with Πpλ, λ1q and the supremum over R with an Lp-style integral over a coupling. This gives rise to the Lp-Risk distance introduced in Section 6. Alternatively, by restricting the set of possible correspondences Cp H, H1q to the subset Cconp H, H1q consisting only of those satisfying a certain topological condition, we arrive at the Connected Risk distance of Section 7.

Geometry and Stability of Supervised Learning Problems

3.5 Connections to the Gromov-Wasserstein Distance

The Hausdorﬀ, Gromov-Hausdorﬀ, Wasserstein, and Gromov-Wasserstein distances all follow a common paradigm. One ﬁrst considers the space of all alignments between the objects in question, whether alignment means a correspondence between sets or a coupling between measures. One then settles on a way to numerically score the distortion of a correspondence or coupling, whether through a supremum or integral. The distance between the objects is then the inﬁmal possible score. The Risk distance follows the same paradigm. An alignment of two problems consists of a correspondence between their predictor sets and a coupling between their joint laws. We use the loss function to numerically score the coupling and correspondence, then inﬁmize over all choices. Indeed, if we think of a loss function as a notion of distance, then the deﬁnition of d R most closely resembles that of the Gromov-Wasserstein distance. The only extra component is a supremum over the predictor correspondence. To make the comparison more explicit, suppose p X, d X, µXq and p X1, d X1, µX1q are metric measure spaces. We can construct the following problems:

P : p X, X, µX b µX, d X, t Id Xuq

P 1 : p X1, X1, µX1 b µX1, d X1, t Id X1uq.

We then write

d Rp P, P 1q inf γPΠpµXbµX,µX1bµX1q

ż ˇˇd Xpx, yq d X1px1, y1q ˇˇ γpdxˆdyˆdx1ˆdy1q.

Here b represents the product of measures. This is exactly twice the Gromov-Wasserstein distance (Deﬁnition 8) except for one diﬀerence: the measure γ. The inﬁmum runs over all of ΠpµX b µX, µX1 b µX1q, while in the true Gromov-Wasserstein distance, γ must take the speciﬁc form γ σ b σ for some σ P ΠpµX, µX1q. Indeed, the expression above (divided by 2) is identiﬁed as the second lower bound of the Gromov-Wasserstein distance by M emoli (2007). Another connection between the Gromov-Wasserstein and Risk distances can be drawn as well. Speciﬁcally, one can encode compact metric measure spaces X as supervised learning problems ip Xq in a natural way which preserves the identiﬁcations induced by the two pseudometrics.

Proposition 33 Given a compact metric measure space p X, d X, µXq, deﬁne a problem

ip Xq : p X, X, p Xq7µX, d X, HXq,

X : X Ñ X ˆ X is the diagonal map, and

HX is the set of all constant functions X Ñ X.

If X and X1 are metric measure spaces, then

d GW,1p X, X1q 0 ðñ d Rpip Xq, ip X1qq 0.

M emoli, Vose, Williamson

The proof can be found in Appendix A.1.1 where we also discuss a connection with a variant of the Gromov-Wasserstein distance by Hang et al. (2019, Section 4.1) that is recovered as d Rpip Xq, ip Y qq. It is a folklore result in optimal transport that deciding whether d GW,1p X, X1q 0 for ﬁnite metric measure spaces X and X1 is at least as hard as the graph isomorphism problem. This is not diﬃcult to prove; let G p V, Eq and G1 p V 1, E1q be ﬁnite simple graphs. We can encode G as a metric measure space V p V, d V , µV q, where d V is the shortest path metric and µV is the uniform probability measure. Similarly construct V 1 p V 1, d V 1, µV 1q. Then d GW,1p V, V 1q 0 if and only if G and G1 are isomorphic. Indeed, if G and G1 are isomorphic then the isomorphism induces the requisite coupling between µV and µV 1. Conversely, if d GW,1p V, V 1q 0, then V and V 1 are isomorphic as metric measure spaces, and in particular are isometric as metric spaces. An isometry between V and V 1

that preserves the shortest path metric is in fact a graph isomorphism. Hence Proposition 33 yields the following corollary.

Corollary 34 The problem of deciding whether d Rp P, P 1q 0 for two problems with ﬁnite input and response spaces is at least as hard as graph isomorphism.

It is desirable to draw a stronger connection between the Risk distance and Gromov Wasserstein distance by ﬁnding a function i such that d GW,1p X, X1q d Rpip Xq, ip X1qq. We will nearly accomplish this goal in Section 6.2 by linking the Lp-Risk distance with a slightly modiﬁed Gromov-Wasserstein distance which is known in the literature.

4. Stability

One major use of the Risk distance is to facilitate stability results. By stability result, we mean a result of one of the two following forms. Suppose we modify a problem P via some process to produce another problem fp Pq. We say that d R is stable under f if every problem P satisﬁes an inequality of the form

d Rp P, fp Pqq ď Jpfq

for some ﬁxed functional J which measures how dissimilar f is from the identity function id : P Ñ P and satisﬁes Jpidq 0. We would say that the Risk distance is stable under f . Stability results of this form assure us that small modiﬁcations to a problem do not move that problem far under d R. For instance, Condition 2 from Section 3.3 is a stability result. In particular, Condition 2 implies that d R is stable under changes to the loss function. Alternatively, a stability result could mean the following. Suppose that, from any problem P, we can compute a descriptor gp Pq which lies in a metric space p Z, d Zq. We say that g is stable under d R if there is some non-negative constant C satisfying

d Zpgp Pq, gp P 1qq ď Cd Rp P, P 1q

for all problems P and P 1. Stability results of this form act as quantitative guarantees that problems close under d R have similar descriptors.

4.1 Stability of the Constrained Bayes Risk and Loss Proﬁle

From a machine learning perspective, the most important feature of a predictor is the loss that it incurs. This information can be captured by a predictor s loss proﬁle, a probability measure

Geometry and Stability of Supervised Learning Problems

on the real line describing the loss that a predictor incurs on a random observation. If P is a problem and h P H is a predictor, the loss proﬁle of h is the pushforward pℓhq7η P Probp Rq. Note that the loss proﬁle of h is the distribution of the random variable ℓphp Xq, Yq, where X and Y are distributed according to η. This is closely related to the random variable ℓphp Xq, Yq Bp Pq, called the excess loss random variable in the literature, diﬀering only by an additive constant. The distribution of the excess loss is related to convergence rates of empirical risk minimization (Van Erven et al., 2015; Gr unwald and Mehta, 2020). Letting h range over all of H, we get an informative description of a problem.

Deﬁnition 35 Given a problem P, the loss proﬁle set of P is the set

Lp Pq : tpℓhq7η|h P Hu .

Note that the loss proﬁle Lp Pq of P contains enough information to recover the constrained Bayes risk. This follows immediately from the deﬁnition (we omit the proof).

Proposition 36 Given a problem P, one has

Bp Pq inf αPLp Pq meanpαq.

Since probability measures on the real line, such as loss proﬁles, can be compared using the Wasserstein metric d W,1, collections of probability measures on R form subsets of the metric space p Probp Rq, d W,1q, and therefore can be compared using the Hausdorﬀdistance d H with underlying metric d W,1. That is, if A, B Ď Probp Rq, the Hausdorﬀdistance between A and B is given by

dd W,1 H p A, Bq inf RPCp A,Bq sup pµ,νq PR d W,1pµ, νq.

Now that we have a way to compare loss proﬁle sets, we can prove their stability under d R. The theorem below establishes this stability property, and in addition, also proves that the constrained Bayes risk is stable in the sense of the Risk distance.

Theorem 37 (Stability of the Constrained Bayes Risk and Loss Proﬁle) If P and P 1 are problems, then

|Bp Pq Bp P 1q| ď dd W,1 H p Lp Pq, Lp P 1qq ď d Rp P, P 1q

The proof, which is relegated to Appendix A, proceeds in two steps. We establish the ﬁrst inequality using the fomulation of the constrained Bayes risk presented in Proposition 36. To prove the second inequality, we notice that the quantity in the middle deﬁnes a pseudometric on the collection of all problems, then apply the maximality claim in Theorem 28 to show that the Risk distance is larger. By ignoring the middle quantity in Theorem 37, we obtain stability the constrained Bayes risk.

Example 38 (Constrained Bayes Risk and Bounded Loss) Here we remark that if the problems P and P 1 have bounded loss functions, then the stability of the constrained Bayes risk, namely that if P and P 1 are problems, then

|Bp Pq Bp P 1q| ď d Rp P, P 1q,

M emoli, Vose, Williamson

can be easily and directly obtained through arguments involving the metric properties of the Risk distance. Indeed, if c ą maxpsup ℓ, sup ℓ1q, then, by Example 31 and the triangle inequality for the Risk distance, we have that

c Bp Pq d Rp P, P pcqq ď d Rp P, P 1q d Rp P 1, P pcqq d Rp P, P 1q c Bp P 1q

so that Bp P 1q Bp Pq ď d Rp P, P 1q. By swapping the roles of P and P 1 we obtain the inequality Bp Pq Bp P 1q ď d Rp P 1, Pq which then, together with the fact that d Rp P 1, Pq d Rp P, P 1q, implies the stability claim. Note that Theorem 37 implies the same claim without assuming the boundedness condition on the loss functions.

4.2 Stability of Rademacher Complexity

In statistical learning theory, the Rademacher complexity quantiﬁes the ﬂexibility of a predictor set and controls generalization error (Mohri et al., 2018). We establish the standard deﬁnition of Rademacher complexity of a function class, along with a related version stated in the language of supervised learning problems. In what follows, if µ is a measure on a measurable space X, let µbm represent the m-fold product measure µ b b µ P Probp Xmq.

Deﬁnition 39 Let m ě 1. Given a probability space p X, µq and a family F of measurable real-valued functions on X, the classical mth Rademacher complexity of F is given by

Rmp Fq : ż ż sup f PF

i 1 σifpxiq Radbmpdσq µbmpdxq,

where Rad is the Rademacher distribution (that is, the uniform distribution on t 1, 1u), σ pσ1, . . . , σmq, x px1, . . . , xmq. Similarly, the mth Rademacher complexity of a problem P is given by

Rmp Pq : ż ż sup h PH

i 1 σiℓhpxi, yiq Radbmpdσq ηbmpdx ˆ dyq.

That is, Rmp Pq is the mth Rademacher complexity of the function class tℓh|h P Hu.

The Rademacher complexity of a problem P is high if, given a random sample ppxi, yiqqm i 1 of m observations and a vector σ of random 1 noise, we can typically ﬁnd some predictor h whose vector of losses pℓhpxi, yiqqm i 1 is highly correlated with the noise σ. This would suggest that H is quite ﬂexible, and possibly that empirical risk minimization over H runs the risk of overﬁtting on a random sample of size m. For a ﬁxed m, the Rademacher complexity Rm need not be stable under the Risk distance.

Example 40 For n ě 2, deﬁne the problem

Pn prns, t0, 1u, unifprnsq b δ0, ℓ, Hnq

where rns : t1, . . . , nu, ℓis the 0-1 loss, and Hn is the set of indicator functions of singletons 1k : rns Ñ t0, 1u for all k P rns. Recall also the one-point problem P P p0q from Example 20.

Geometry and Stability of Supervised Learning Problems

We claim that R1p P q 0, R1p Pnq 1{2 for all n, but d Rp Pn, P q Ñ 0 as n Ñ 8. This shows that the Rademacher complexity R1 is not continuous with respect to the Risk distance. One can easily see that R1p P q 0, since the loss function of P is identically 0. We can also easily compute

R1p Pnq ż 1 2 sup k Prns ℓp1kpxq, yq 1

2 inf k Prns ℓp1kpxq, yq

punifprnsq b δ0qpdx ˆ dyq

2p0q punifprnsq b δ0qpdx ˆ dyq 1

Meanwhile, since there is only one coupling with the point mass δp , q and only one correspondence with the singleton t Idt uu, we can compute

d Rp Pn, P q dis Pn,P p Hn ˆ t Idt uu, unifprnsq b δ0 b δp , qq

ż |ℓp1kpxq, 0q 0| unifprnsqpdxq

ż 1kpxq unifprnsqpdxq sup k Prns

Hence, even though Pn Ñ P in the Risk distance, the Rademacher complexities R1p Pnq do not converge to R1p P q.

Example 40 shows that R1 is not even continuous with respect to the Risk distance. This suggests more generally that we should not hope for straightforward stability of Rm under the Risk distance when m is small. If we instead take m to be arbitrarily large, we can still obtain a kind of stability.

Theorem 41 (Stability of Rademacher Complexity) Let P and P 1 be problems and consider the family

F : ˇˇℓhp , q ℓ1 h1p , q ˇˇ ˇˇh P H, h1 P H1(

of functions X ˆ Y ˆ X1 ˆ Y 1 Ñ R. Then ˇˇRmp Pq Rmp P 1q ˇˇ ď d Rp P, P 1q 2Rmp Fq.

If P and P 1 have bounded loss functions and F forms a universal Glivenko-Cantelli class, then lim sup m

ˇˇRmp Pq Rmp P 1q ˇˇ ď d Rp P, P 1q.

The proof can be found in Appendix A. The conditions under which Rmp Fq Ñ 0 as m Ñ 8, and the rate of said convergence, are classical subjects of research in statistical learning theory. Results in this direction typically provide convergence rates of order Op a

1{mq (Bartlett and Mendelson, 2002) or Op a

lnpmq{mq (Mohri et al., 2018, Chapter 3) under speciﬁc assumptions. One can apply Proposition 13 to produce at a weaker form of Theorem 41 which does not reference the Glivenko-Cantelli property. Speciﬁcally, we can replace the universal Glivenko-Cantelli condition Theorem 41 with the following pair of suﬃcient conditions:

M emoli, Vose, Williamson

1. The sets H and H1 can be endowed with Polish topologies, and subsequently Borel σ-algebras, such that the maps ph, x, yq ÞÑ ℓhpx, yq and ph1, x1, y1q ÞÑ ℓ1 h1px1, y1q are measurable.

2. The family of subgraphs sgpgh,h1q ˇˇh P H, h1 P H1( (see Deﬁnition 12) has ﬁnite VC dimension, where gh,h1px, y, x1, y1q : ˇˇℓhpx, yq ℓ1 h1px1, y1q ˇˇ.

In fact, we can state an even simpler suﬃcient condition involving the two problems P and P 1 separately.

Corollary 42 Let P and P 1 be problems with bounded loss functions. Assume that H and H1 are equipped with Polish topologies such that the maps ph, x, yq ÞÑ ℓhpx, yq and ph1, x1, y1q ÞÑ ℓ1 h1px1, y1q are measurable. If both tℓh|h P Hu and ℓ1 h1 ˇˇh1 P H1( lie in ﬁnite-dimensional subspaces of L1p X ˆ Y q and L1p X1 ˆ Y 1q respectively, then

ˇˇRmp Pq Rmp P 1q ˇˇ ď d Rp P, P 1q.

The subspace condition in Corollary 42, while simple, may seem strange. We claim the condition is not as odd as it may at ﬁrst appear.

Example 43 Consider the generic linear regression problem P from Example 19. For that problem, the functions ℓh are of the form

ℓhpx, yq pa T x b yq2

for a P Rn, b P R. These functions are polynomials in x and y of degree at most 2. Hence the collection tℓh|h P Hu Ď L1p Rn ˆRq lies within the ﬁnite-dimensional subspace of polynomials of degree at most 2. Hence P satisﬁes the subspace condition of Corollary 42. More generally, any problem

P p Rn, Rm, η, ℓ, Hq

will satisfy the subspace condition if ℓis a polynomial loss function and H is of the form

H ta1f1pxq akfkpxq|a1, . . . , ak P Ru

for some ﬁxed f1, . . . fk : Rn Ñ Rm. This is because the set tℓh|h P Hu will fall within a ﬁnite-dimensional space of polynomials in the variables f1, . . . fk and y.

Proof [Corollary 42] Let

S : spanp ℓhp , q ℓ1 h1p , q ˇˇh P H, h1 P H1( q,

where the span is taken in the space of all measurable functions X ˆ Y ˆ X1 ˆ Y 1 Ñ R. Consider maxp S, Sq : tmaxpf, gq|f, g P Su ,

where the maximum maxpf, gq is taken pointwise. Since S is ﬁnite-dimensional, it follows (Van Der Vaart and Wellner, 1996, Lemma 2.6.15 and Lemma 2.6.18(ii)) that the collection of subgraphs of functions in maxp S, Sq has ﬁnite VC dimension. The set gh,h1 ˇˇh P H, h1 P H1(

lies in maxp S, Sq, so Proposition 13 ﬁnishes the proof.

Geometry and Stability of Supervised Learning Problems

4.3 Stability under Sampling Bias

A fundamental source of error in learning problems of any type is bias in the collection of training data. Suppose η is the true joint law that we are trying to approximate with a predictor. While we would like to sample from η directly, an imperfect collection process may over-represent certain regions of the data space. One would hope that minor bias would only have a minor eﬀect on the problem. More formally, we might ask if the Risk distance is stable under small modiﬁcations to a problem s joint law. This section provides conditions for such stability results. The Risk distance is stable under changes to the joint law with respect to the Wasserstein distance with a certain underlying metric. Under certain conditions, it is also stable with respect to the total variation distance (Deﬁnition 9). Deﬁne the pseudometric sℓ,H on X ˆY by sℓ,Hppx, yq, px1, y1qq : sup h PH

ˇˇℓphpxq, yq ℓphpx1q, y1q ˇˇ.

Proposition 44 Let P be a problem, and let P 1 be a problem that is identical to P except possibly for its joint law η1. Then

d Rp P, P 1q ď dsℓ,H W,1pη, η1q.

Suppose furthermore that the loss function ℓis bounded above by some constant ℓmax. Then

d Rp P, P 1q ď ℓmaxd TVpη, η1q.

Proof One can bound d Rp P, P 1q by choosing the diagonal correspondence between H and itself to get

d Rp P, P 1q ď inf γPΠpη,η1q sup h PH

ż ˇˇℓphpxq, yq ℓphpx1q, y1q ˇˇ γpdxˆdyˆdx1ˆdy1q

ď inf γPΠpη,η1q

ˇˇℓphpxq, yq ℓphpx1q, y1q ˇˇ γpdxˆdyˆdx1ˆdy1q dsℓ,H W,1pη, η1q,

proving the ﬁrst part of the proposition. Now, if ℓis bounded by ℓmax then sℓ,H is bounded by ℓmaxd, where d is the discrete metric. Hence

dsℓ,H W,1pη, η1q ď dℓmaxd W,1 pη, η1q ℓmaxd TVpη, η1q

as desired.

In the presence of sampling bias, the data collection process gives inordinate weight to some regions of the data space X ˆ Y at the cost of other regions. We can think of this biased data collection as sampling not from the true probability measure η, but instead from a scaled probability measure given by

A fpx, yq ηpdx ˆ dyq

M emoli, Vose, Williamson

for all measurable A, where f : X ˆ Y Ñ Rě0 is a density function with respect to η, meaning f is a measurable function with ş fpx, yq ηpdxˆdyq 1. Here f is larger than 1 in regions of X ˆ Y that are overrepresented by the data collection process, and less than 1 in regions underrepresented. In extreme cases of sampling bias, some regions of the data space X ˆ Y may be completely missing from the training data. In this case, instead of being sampled from the true joint law η, the data is sampled from some subset A Ď X ˆ Y with ηp Aq ą 0 according to the probability measure

η|Ap Bq : 1 ηp Aqηp A X Bq.

Using Proposition 44, we can show that d R is stable under the above formulations of sampling bias.

Corollary 45 (Stability Under Sampling Bias) Let P be a problem.

(1) Let f : X ˆ Y Ñ Rě0 be a density function with respect to η. Deﬁne

Pf : p X, Y, fη, ℓ, Hq.

d Rp P, Pfq ď 1

ż ˇˇ1 fpx, yq ˇˇ ηpdxˆdyq.

(2) Let A Ď X ˆ Y be a measurable subset with ηp Aq ą 0. Deﬁne

P|A : p X, Y, η|A, ℓ, Hq.

d Rp P, P|Aq ď 1 ηp Aq.

Proof First recall that

d TVpη, η1q sup BĎXˆY ηp Bq fηp Bq sup BĎXˆY fηp Bq ηp Bq,

where the inﬁma range over all measurable subsets of X ˆ Y . Then

d TVpη, η1q 1

ˆ sup BĎXˆY ηp Bq fηp Bq sup BĎXˆY fηp Bq ηp Bq

ˆ sup BĎXˆY

B 1 fpx, yq ηpdxˆdyq sup BĎXˆY

B fpx, yq 1 ηpdxˆdyq

tfă1u 1 fpx, yq ηpdxˆdyq ż

tfą1u fpx, yq 1 ηpdxˆdyq

ż ˇˇ1 fpx, yq ˇˇ ηpdxˆdyq.

To ﬁnish the proof, apply part (1) when f 1 ηp Aq1A.

Geometry and Stability of Supervised Learning Problems

4.4 Stability under Sampling Noise

Another source of error in sampling is the presence of noise in the process of collecting data. While we would like to sample from η directly, our data is rarely collected with perfect ﬁdelity. One would hope that a small amount of noise would not change the problem much. In this section, we model noise with a noise kernel and show that we can often guarantee that the Risk distance is stable under the application of such noise. First we consider the case where the loss function ℓis bounded by a metric on Y . Our bounds are in terms of a generalization of the Wasserstein metric for Markov kernels.

Deﬁnition 46 Let X be a measure space and let M : X Ñ Probp Y q, N : X Ñ Probp Zq be Markov kernels. A coupling between M and N is a Markov kernel τ : X Ñ Probp Y ˆ Zq such that for a.e. x P X,

pπ1q7pτpxqq Mpxq,

pπ2q7pτpxqq Npxq.

Just as we use Πpµ, νq to denote the set of all couplings of probability measures µ and ν, we use Πp M, Nq to denote the set of all couplings of Markov kernels M and N.

Note that if X is a single point, we recover the classical deﬁnition of a coupling of probability measures.

Deﬁnition 47 (Wasserstein Distance for Markov Kernels) Let p X, µq be a measure space and let p Y, d Y q be a metric space. Let M, N : X Ñ Probp Y q be a pair of Markov kernels. For 1 ď p ă 8, the p-Wasserstein distance between M and N is given by

d W,pp M, Nq : ˆ inf τPΠp M,Nq

Y ˆY dp Y py, y1qτpxqpdy ˆ dy1qµpdxq 1{p .

Deﬁnitions 46 and 47 are due to Patterson (2020, Defs 5.1 and 5.2). Patterson notes that terminology conﬂicts between products and couplings of Markov kernels are present in the literature. As with the Wasserstein distance, we will have little use for d W,p for p 1, and so establish the convention that the Wasserstein distance for Markov kernels refers to d W,1. An alternative, equivalent deﬁnition of the metric d W,1 may be more intuitive:

d W,1p M, Nq ż

X dd Y W,1p Mpxq, Npxqq µpdxq.

Equivalence of these deﬁnitions was noted by Patterson (2020), and follows from results in the theory of optimal transport (Villani, 2008, Corollary 5.22). A distance based on a generalization of the above alternative deﬁnition was independently discovered and studied by Kitagawa and Takatsu (2023) under the name disintegrated Monge-Kantorovich distance . We are abusing notation by using the same symbol for the Wasserstein distances between measures and between Markov kernels. We justify this by noting that the former is a special case of the latter. Indeed, recall that if µ, ν P Probp Xq are probability measures on a metric space X, we can think of µ and ν as Markov kernels from the one-point measure space pt u, δ q. The Wasserstein distance between these Markov kernels then agrees with the Wasserstein distance between the measures µ and ν.

M emoli, Vose, Williamson

The Wasserstein distance for Markov kernels also generalizes the L1 distance on measurable functions. Indeed, if we have measurable functions f, g : X Ñ Y , then the distance d W,1pδf, δgq between the induced deterministic Markov kernels is just the L1 distance between f and g. A simple kind of noise can be modeled with a Markov kernel M : Y Ñ Probp Y q. We interpret this noise as follows. When we make an observation that would have label y P Y , we instead observe a random corrupted label with law Mpyq. More speciﬁcally, let P be a problem whose joint law has X-marginal α and X-disintegration β : X Ñ Probp Y q. Without the presence of noise, an observation has a random input component x chosen with law α, and a random label chosen with law βpxq. The presence of noise represented by M creates a new problem where the labels are instead chosen according to the law pβ Mqpxq. In other words, we get a problem with the same X-marginal α, but with disintegration β1 : X Ñ Probp Y q given by β1pxq : pβ Mqpxq.

We might call this type of noise X-independent Y -noise, (otherwise known as simple label noise ) since the noise M is applied in the Y direction, and the same noise kernel M is applied for all x P X. A more general kind of noise would allow for dependence on X. Instead of a Markov kernel M : Y Ñ Probp Y q, we model our noise as a Markov kernel N : X ˆ Y Ñ Probp Y q. We write Nx to mean the Markov kernel Npx, q : Y Ñ Probp Y q. In the presence of this kind of noise, the problem P becomes a new problem P 1 with the same X-marginal α, but X-disintegration β1 now given by

β1pxq : pβ Nxqpxq.

We might call this kind of noise X-dependent Y -noise . Consider a noise kernel N δπ2, which is the Markov kernel sending px, yq to the point mass δy. Then for all x P X, we have Nxpyq δy and hence β Nx β. In other words, N represents no noise at all, and applying N to a problem would leave the problem unchanged. Our next stability result states that, if any kernel N is close to this no noise case under d W,1, then applying N will not move the problem much in the Risk distance, provided that ℓis well-controlled by some metric on Y .

Lemma 48 Suppose that P and P 1 are problems given by

P p X, Y, η, ℓ, Hq

P 1 p X, Y, η1, ℓ, Hq

where η and η1 have the same X-marginal α P Probp Xq, but diﬀerent disintegrations along X, called β and β1 respectively. Suppose also that Y is equipped with a metric d Y with respect to which ℓis Lipschitz in the second argument, meaning there is a constant C ą 0 such that ˇˇℓpz, yq ℓpz, y1q ˇˇ ď Cd Y py, y1q

for all y, y1, z P Y . Then d Rp P, P 1q ď Cdd Y W,1pβ, β1q.

Geometry and Stability of Supervised Learning Problems

Theorem 49 (Stability under X-dependent Y -noise) Let P be a problem whose output space Y is equipped with a metric d Y , with respect to which the loss function ℓis C-Lipschitz in the second argument. Let P 1 be the problem obtained by applying the noise kernel N : X ˆ Y Ñ Probp Y q to P. Then

d Rp P, P 1q ď Cdd Y W,1p N, δπ2q.

The proofs are relegated to Appendix A. We illustrate Theorem 49 with an example regarding label noise in binary classiﬁcation.

Example 50 Let P p X, t0, 1u, η, ℓ, Hq be a binary classiﬁcation problem where ℓis the 0-1 loss ℓpy, y1q 1y y1. Suppose we apply noise to P by taking each observation px, yq and doing the following:

With probability ε, re-assign its label y uniformly at random.

With probability p1 εq, leave the label y as it is.

Through this process we are eﬀectively applying the noise kernel

Nε : εu p1 εqδπ2,

where u is the uniform measure on t0, 1u. Let P 1 ε be the problem P with the noise kernel Nε applied. Since the 0-1 loss is a metric, we can apply Theorem 49 with d Y ℓand C 1 to get

d Rp P, P 1 εq ď d W,1pεu p1 εqδπ2, δπ2q

ż dℓ W,1pεu p1 εqδy, δyq ηpdxˆdyq.

Recall (Section 2.4) that when the underlying metric is the discrete metric, the Wasserstein distance is the same as the total variation distance. Furthermore, one can show that d TVpεu p1 εqδy, δyq εd TVpu, δyq ε{2. Hence we get

d Rp P, P 1 εq ď ε{2.

One could generalize further; suppose that the probability ε of reassignment depends on x. That is, instead of selecting a constant ε, we select a measurable function ε : X Ñ r0, 1s. Then, for an observation px, yq, with probability εpxq we re-assign y uniformly at random. Then the same argument above instead gives us the bound

d Rp P, P 1 εq ď 1

ż εpxqαpdxq,

where α is the X-marginal of η. That is, the risk distance between P and the noised problem P 1 ε is bounded by half the average value of εpxq.

In Section 6.5, we will consider an even more general noise model, allowing for noise in the X and Y directions simultaneously. We will show that a modiﬁcation of the Risk distance, called the Lp-Risk distance, is stable under such noise in Theorem 91, which is analogous to Theorem 49.

M emoli, Vose, Williamson

Figure 3: Distance between H and HX,Y ; see Example 52.

4.5 Stability under Modiﬁcation of Predictor Set

We have shown that d R is stable under changes to the loss function and certain changes to η. We round out this section by demonstrating stability of d R under changes to the predictor set of a problem. To state such a stability result, we ﬁrst need a way to quantify changes to the predictor set. The most salient aspect of a predictor h is its performance with respect to the chosen loss function. In other words, more crucial than the predictor h itself is the function ℓhpx, yq : ℓphpxq, yq. With this perspective in mind, we can map H into L1pηq via the map h ÞÑ ℓh, and compare predictors using the L1 norm there. That is, we compare two predictors h and h1 via the quantity

dℓ,ηph, h1q : }ℓh ℓh1}L1pηq ż ˇˇℓhpx, yq ℓh1px, yq ˇˇ ηpdxˆdyq.

This deﬁnes a pseudometric on the set of all measurable functions h : X Ñ Y for which }ℓh}L1pηq ă 8, which any predictor satisﬁes since the deﬁnition of a problem mandates that predictors have ﬁnite risk. The map h ÞÑ ℓh identiﬁes two predictors h and h1 if they η-almost always incur the same loss on a random observation in X ˆ Y . We can now state a stability theorem in terms of the Hausdorﬀdistance on predictor sets with underlying metric dℓ,H, i.e. the metric

ddℓ,η H p H, H1q inf RPCp H,H1q sup ph,h1q PR }ℓh ℓh1}L1pηq.

Theorem 51 Let P be a problem, and let P 1 be a problem that is identical to P except possibly for the predictor set H1. Then

d Rp P, P 1q ď ddℓ,η H H, H1 .

Geometry and Stability of Supervised Learning Problems

Example 52 Note that if P p X, Y, η, ℓ, Hq and P 1 p X, Y, η, ℓ, HX,Y q, where HX,Y stands for the set of all measurable functions from X to Y , Theorem 37 implies that

B p Pq ď Bp Pq ď B p Pq sup h1PHX,Y inf h PH dℓ,ηph, h1q.

The right-hand side contains the worst case dℓ,η distance between an arbitrary measurable function h1 and the (constrained) predictor set H. In other words, this quantity expresses how far H is from the superclass HX,Y and therefore controls the approximation error: the diﬀerence between the constrained and unconstrained Bayes risks; see Figure 3.

Proof [Theorem 51] We bound d Rp P, P 1q by choosing the diagonal coupling between η and itself. That is,

d Rp P, P 1q ď inf RPCp H,H1q sup ph,h1q PR

ż ˇˇℓphpxq, yq ℓph1pxq, yq ˇˇ ηpdxˆdyq

inf RPCp H,H1q sup ph,h1q PR }ℓh ℓh1}L1pηq ddℓ,η H p H, H1q,

which is the desired bound.

We illustrate the metric ddℓ,η H and Theorem 51 with a toy example.

Example 53 Let P p R2, t0, 1u, η, ℓ, Hq be the classiﬁcation problem from Example 18. Recall that the predictors in H were the functions of the form

ha,b,cpx1, x2q :

# 1 ax1 bx2 ě c 0 ax1 bx2 ă c .

We modify H by adding to each predictor a small area on which they always predict class 1. That is, we deﬁne a new predictor set H1 t : ! h1 a,b,c,t ˇˇˇa, b, c P R ) where h1 a,b,c,t : R2 Ñ t0, 1u is given by

h1 a,b,c,tpx1, x2q :

# 1 x P Btp0q ha,b,cpxq otherwise .

Here Btp0q represents the ball of radius t centered at 0. We can construct a correspondence between H and H1 t by coupling each ha,b,c with h1 a,b,c,t. That is, we deﬁne

R : pha,b,c, h1 a,b,c,tq ˇˇa, b, c P R ( .

Then R P Cp H, H1 tq, so we can write the bound

ddℓ,η H p H, H1 tq ď sup pha,b,c,h1 a,b,c,tq PR }ℓha,b,c ℓha,b,c,t}L1pηq

sup a,b,c PR

ż ˇˇℓpha,b,cpxq, yq ℓpha,b,c,tpxq, yq ˇˇ ηpdxˆdyq

sup a,b,c PR

Btp0qˆt0,1u

ˇˇℓpha,b,cpxq, yq ℓpha,b,c,tpxq, yq ˇˇ ηpdxˆdyq

ď ηp Btp0q ˆ t0, 1uq.

M emoli, Vose, Williamson

Furthermore, assuming ηp0ˆt0, 1uq 0, we can take the limit as t Ñ 0 to get limtÑ0 ddℓ,η H p H, H1 tq 0. If we deﬁne Pt : p R2, t0, 1u, η, ℓ, H1 tq for all t ą 0, By Theorem 51, Pt converges to P under the Risk distance as t Ñ 0.

We now provide a more serious application of Theorem 51. Diﬀerent classes of predictors are often compared in the context of universal approximation theorems, which generally state that a certain kind of function can be approximated arbitrarily well using a certain class of model. Using Theorem 51, one can translate universal approximation theorems into statements about the risk distance. While there are many universal approximation theorems (Pinkus, 1999), we will demonstrate with a classical example (see Leshno et al., 1993).

Proposition 54 Let φ : R Ñ R be a continuous non-polynomial function. Let P p X, R, η, ℓ, Hq be a problem where

X is a compact subset of Rm,

η is any probability distribution on X,

ℓis a uniformly continuous loss function, and

H is the set of all functions X Ñ R which can come from a feedforward network with a single hidden layer of arbitrary width and with activation function φ. That is, H is the set of all functions of the form

px1, . . . , xmq ÞÑ

j 1 wijxj bi

where the parameters ai, wij, and bj range over R, and M ranges over the positive integers.

Let P 1 be the same problem, but with predictor set H1 the set of all continuous functions X Ñ R. Then d Rp P, P 1q 0.

Proof Let h1 : X Ñ R be continuous and let ε ą 0 be arbitrary. By uniform continuity of ℓ, there exists a δ ą 0 such that, if |y y1| ă δ, then |ℓpy, y2q ℓpy1, y2q| ă ε for any y2 P R. By (Leshno et al., 1993, Thm 1), there exists an h P H with |hpxq h1pxq| ă δ for all x P X. Then

dℓ,ηph, h1q ż ˇˇℓphpxq, yq ℓph1pxq, yq ˇˇ ηpdxˆdyq ď ż ε ηpdxˆdyq ε.

Since H Ď H1, this implies the Hausdorﬀdistance ddℓ,η H p H, H1q is less than ε. By Theorem 51, d Rp P, P 1q ă ε as well.

In other words, under certain conditions, the risk distance does not distinguish between a model and the class of functions in which it is dense.

Geometry and Stability of Supervised Learning Problems

Figure 4: A simple depiction of two problems P and P 1 in the space of problems, joined by a geodesic.

5. Geometry of the Space of Problems

We now shift our attention from stability properties of d R to its geometric properties. We will explore geodesics and their connection to optimal couplings and correspondences. We also show that classiﬁcation problems (that is, those with ﬁnite output spaces) are dense in a larger class of problems. We will use this density to study the eﬀect of limited sampling by studying the convergence of empirical problems, which are problems whose joint laws are given by empirical measures gleaned from ﬁnite random samples.

5.1 Geodesics

We begin our exploration of the geometry of the space of problems by searching for geodesics. A geodesic in a metric space p X, d Xq is a continuous function from the unit interval f : r0, 1s Ñ X such that, for all s, t P r0, 1s,

d Xpfptq, fpsqq d Xpfp0q, fp1qq|s t|.

Some would call such an f a shortest path (Burago et al., 2001, Chapter 2.5) and reserve the term geodesic for a path that locally satisﬁes the above property. However, the above deﬁnition is standard in the ﬁeld of optimal transport (Ambrosio et al., 2021, Section 9.2, Eq. 9.4). While valuable in illuminating the geometry of a space, geodesics are not solely of geometric interest. A geodesic between two points gives us a way to interpolate or average between them. If a family t Ptut Pr0,1s of problems deﬁnes a geodesic in the space of problems, then P1{2 could reasonably be considered an average of P0 and P1. If we would rather take a weighted average, we could choose smaller or larger values of t to give more weight to P0 or to P1 respectively. We will show that geodesics can be supplied by certain couplings and correspondences.

Deﬁnition 55 Let P and P 1 be problems. If there exists a coupling γ P Πpη, η1q and a correspondence R P Cp H, H1q such that

d Rp P, P 1q dis P,P 1p R, γq,

then we call γ an optimal coupling and R an optimal correspondence for P and P 1. That is, γ and R are called optimal if they achieve the inﬁma in the deﬁnition of d Rp P, P 1q.

M emoli, Vose, Williamson

In Section 5.2, we will explore suﬃcient conditions for the existence of optimal couplings and correspondences. Presently, we motivate our interest in optimal couplings and correspondences by showing that they are closely related to the geometry of the space of problems; optimal couplings and correspondences provide us with explicit geodesics in the space of problems.

Theorem 56 Suppose P0 and P1 admit an optimal coupling γ and an optimal correspondence R. Then the following family of problems is a geodesic between P0 and P1:

Pt : p X0 ˆ X1, Y0 ˆ Y1, γ, ℓt, Rˆq

for all t P r0, 1s, where

ℓtppy0, y1q, py1 0, y1 1qq : p1 tqℓ0py0, y1 0q tℓ1py1, y1 1q.

Rˆ : th0 ˆ h1|h0 P H0, h1 P H1u.

It may at ﬁrst appear that we have given P0 two conﬂicting deﬁnitions in the above statement, once as the given

P0 : p X0, Y0, η0, ℓ0, H0q

and again by setting t 0 in our parameterized family, yielding

P0 : p X0 ˆ X1, Y0 ˆ Y1, γ, ℓ0, Rˆq.

However, the latter problem is a simulation of the former via the projection maps X0 ˆX1 Ñ X0 and Y0 ˆ Y1 Ñ Y0. Hence the two problems are distance zero apart under the Risk distance and can be safely identiﬁed. A similar line of thought applies to P1. The proof that this family is a geodesic is very similar to the proofs of analogous results for the Gromov-Hausdorﬀand Gromov-Wasserstein distances (Chowdhury and M emoli, 2018; Sturm, 2023). Proof Deﬁne C : d Rp P0, P1q. Let 0 ď s ď t ď 1. If we can show

d Rp Ps, Ptq ď Cpt sq,

then equality will follow. Indeed, if the inequality were strict for some 0 ď s0 ď t0 ď 1, then we could use the triangle inequality to write

C d Rp P0, P1q ď d Rp P0, Ps0q d Rp Ps0, Pt0q d Rp Pt0, P1q

ă s0C pt0 s0q C p1 s0q C C,

which is a contradiction.

Geometry and Stability of Supervised Learning Problems

Now, we can bound d Rp Ps, Ptq by selecting the diagonal coupling from Πpγ, γq and diagonal correspondence from Cp Rˆ, Rˆq to get

d Rp Ps, Ptq

ď sup h0ˆh1PRˆ

ż ˇˇℓspph0px0q, h1px1qq, py0, y1qq ℓtpph0px0q, h1px1qq, py0, y1qq ˇˇ γpdx0ˆdy0ˆdx1ˆdy1q

sup h0ˆh1PRˆ

ż ˇˇpt sq ℓ0ph0px0q, y0q ℓ1ph1px1q, y1q ˇˇ γpdx0ˆdy0ˆdx1ˆdy1q

pt sq sup ph0,h1q PR

ż ˇˇℓ0ph0px0q, y0q ℓ1ph1px1q, y1q ˇˇ γpdx0ˆdy0ˆdx1ˆdy1q

The last equality is because γ and R are an optimal coupling and correspondence of P0 and P1.

5.2 Existence of Optimal Couplings and Correspondences

Motivated by Theorem 56, we search for conditions under which optimal couplings and correspondences exist. We will achieve this by imposing some continuity and compactness conditions on our problems.

Deﬁnition 57 A problem P is called continuous if, for all h P H, ℓh is continuous η-almost everywhere.

Note that the above condition applies to many relevant problems. A typical linear regression problem will be continuous. For instance, the linear regression problem in Example 19 is continuous, since the loss is continuous, as is every predictor. Under reasonable assumptions, a classiﬁcation problem can be continuous as well; one only needs the possible decision boundaries to each have measure zero in the X-marginal. Take the classiﬁcation problem P in Example 18. In that problem, ℓh is discontinuous for every predictor h. However, if we assume that any line in R2 has measure zero with respect to the X-marginal of η (as would be the case if the X-marginal of η is given by a continuous probability distribution), then any linear decision boundary has measure zero. Consequently, ℓh will be continuous (indeed, constant) η-almost everywhere. Let P, P 1 be continuous problems. Recall that the risk distortion is the function

dis P,P 1 : Cp H, H1q ˆ Πpη, η1q Ñ R

dis P,P 1p R, γq : sup ph,h1q PR

ż ˇˇℓhpx, yq ℓh1px1, y1q ˇˇ γpdxˆdyˆdx1ˆdy1q.

We would like to discuss the potential continuity of this function, and hence require a suitable topology on its domain. We endow Πpη, η1q with the weak topology. We give the set tℓh|h P Hu the L1pηq topology, and endow H with the initial topology induced by the map h ÞÑ ℓh. Equivalently, we endow H with the pseudometric dℓ,η deﬁned in

M emoli, Vose, Williamson

Section 4.5 and similarly equip H1 with dℓ1,η1. We then endow H ˆ H1 with the product metric δppg, g1q, ph, h1qq : maxtdℓ,ηpg, hq, dℓ1,η1pg1, h1qu, and place the resulting Hausdorﬀ distance dδ H on Cp H, H1q. Deﬁne Ă dis P,P 1 : H ˆ H1 ˆ Πpη, η1q Ñ R

Ă dis P,P 1ph, h1, γq : ż ˇˇℓhpx, yq ℓh1px1, y1q ˇˇ γpdxˆdyˆdx1ˆdy1q.

Hence dis P,P 1p R, γq sup ph,h1q PR Ă dis P,P 1ph, h1, γq.

We proceed with a sequence of lemmas. We begin by showing that, under certain conditions, Ă dis P,P 1 is lower semi-continuous, then leverage this result to obtain lower semicontinuity of dis P,P 1. Recall that a function f : S Ñ R on a topological space S is lower semi-continuous if, for all x0 P S, we have lim infxÑx0 fpxq ě fpx0q. Lower semi-continuity is a useful notion because such functions satisfy a limited version of the extreme value theorem. Speciﬁcally, a lower semi-continuous function on a compact space achieves its minimum.

Lemma 58 Let P and P 1 be continuous problems. Then Ă dis P,P 1 is lower semi-continuous.

Lemma 59 Let P and P 1 be continuous problems. Then dis P,P 1 is lower semi-continuous.

Using the fact that a lower semi-continuous function on a compact set achieves its minimum, we can prove the following Theorem.

Theorem 60 If P and P 1 are continuous problems with compact predictor sets, then there exists an optimal coupling and correspondence for P and P 1. That is, there exists R P Cp H, H1q and γ P Πpη, η1q such that

d Rp P, P 1q dis P,P 1p R, γq.

The proofs can be found in Appendix A. In turn, Theorem 60 and the characterization of weak isomorphism provided by Theorem 29 give us conditions under which weak isomorphism is equivalent to being Risk distance zero apart.

Corollary 61 If P and P 1 are continuous problems with compact predictor sets, then P and P 1 are weakly isomorphic if and only if d Rp P, P 1q 0.

5.3 Density of Classiﬁcation Problems

One beneﬁt of equipping a set with a pseudometric is that, in return, we receive a notion of convergence. One can ask whether a sequence of points converges, how quickly it does so, and what the limit might be. Density results pave the way for limiting arguments. It is often possible, by a limiting argument, to transfer knowledge about a simple dense subspace to the space as a whole. In this section, we show that classiﬁcation problems (that is, problems whose output space is a ﬁnite collection of labels) are dense in a broader class of functions under d R. We will leverage this information in Section 5.4 to prove the main result of the section: the convergence of empirical problems (Theorem 69).

Geometry and Stability of Supervised Learning Problems

Figure 5: A depiction of the coarsening process in Example 64. On the left is a heat map depicting the joint law η of a problem P whose input and output spaces are both the interval X Y r0, 3s. We consider the partition Q tr0, 1q, r1, 2q, r2, 3su of Y r0, 3s. The coarsened problem PQ : p X, Q, ηQ, ℓQ, HQq has three points in the output space Q, represented by the vertical axis. The joint law ηQ is then a probability measure on three copies of r0, 3s. We depict ηQ on the right with density plots of the three class-conditional probability measures.

Deﬁnition 62 (Classiﬁcation Problem) A problem is called a classiﬁcation problem if its response space is ﬁnite. A problem is called compact if its response space is compact and its loss function is continuous.

A classiﬁcation problem is so called because we can think of its response space as a ﬁnite collection of labels. In Appendix B, we characterize the problems that are weakly isomorphic to classiﬁcation problems. Given a compact problem P, we can produce a related classiﬁcation problem PQ by clustering the response space into a ﬁnite partition Q and applying a procedure we call coarsening. The deﬁnition of a coarsening makes use of a certain map associated to a partition. Given a set A and a partition Q of A, consider the map πQ : A Ñ Q which sends each a P A to the unique partition block q P Q which contains a. That is, πQ is deﬁned such that a P πQpaq for all a P A. This map πQ is often called the quotient map associated to Q.

Deﬁnition 63 (Coarsening) Let P be a problem, let Q be a ﬁnite partition of Y into measurable sets, and let πQ : Y Ñ Q be the quotient map associated to Q. The coarsening of P with respect to Q is the classiﬁcation problem

PQ p X, Q, ηQ, ℓQ, HQq

ηQ : p Id X ˆπQq7η

ℓQpq, q1q : supy Pq,y1Pq1 ℓpy, y1q,

M emoli, Vose, Williamson

HQ : tπQ h|h P Hu.

Note that if Q is a ﬁnite partition of Y , then Q is itself a ﬁnite set of partition blocks. The response space Q of the coarsened problem PQ is a ﬁnite set with one element for each block of the partition Q, and hence PQ is indeed a classiﬁcation problem.

Example 64 An example of the coarsening procedure is depicted in Figure 5. We imagine a problem P pr0, 3s, r0, 3s, η, ℓ, Hq

where η is as pictured, ℓpy1, y2q : |y1 y2| is the Euclidean distance, and H is the set of increasing continuous bijections r0, 3s Ñ r0, 3s. We choose the partition Q tr0, 1q, r1, 2q, r2, 3su of P and consider the coarsened problem

PQ pr0, 3s, Q, ηQ, ℓQ, HQq.

Then PQ is a classiﬁcation problem with three points in its output space Q tq1, q2, q3u, where q1 : r0, 1q, q2 : r1, 2q, and q3 : r2, 3s. Then ηQ is a probability measure on r0, 3sˆQ, a space which consists of three line segments: r0, 3s ˆ tq1u, r0, 3s ˆ tq2u, and r0, 3s ˆ tq3u. The coarsened loss

ℓQpqi, qjq : sup y Pqi,y1Pqj ℓpy, y1q for i, j 1, 2, 3,

can be written in the form of a table:

ℓQ q1 q2 q3 q1 1 2 3 q2 2 1 2 q3 3 2 1

The predictor set HQ is tπQ h|h P Hu, where πQ : Y Ñ Q is the quotient map. Then HQ can also be written as the set of all functions g : r0, 3s Ñ Q of the form

q1 t ă a q2 a ď t ă b q3 b ď t

for all 0 ă a ď b ă 3.

Coarsening is reminiscent of the probing reduction (Langford and Zadrozny, 2005), which takes a probability estimation problem and solves it with an ensemble of classiﬁer learners, hence reducing the problem of probability estimation to the problem of binary classiﬁcation. Indeed, one can view the probing reduction as a practical method of approximating a problem whose output space is Y r0, 1s by iteratively constructing ﬁner and ﬁner partitions of Y and solving the resulting coarsened problems with binary classiﬁers.

Theorem 65 Let P be a compact problem. For any ε ą 0, there is a partition Q of the response space such that d Rp P, PQq ă ε.

Geometry and Stability of Supervised Learning Problems

The proof is relegated to Appendix A.

Remark 66 One notable feature of the proof of Theorem 65 is that the partition Q provided by the theorem does not depend on the joint law η of P. More precisely, suppose P is a problem. For any probability measure η1 P Probp X ˆ Y q, we can create a new problem

Pη1 : p X, Y, η1, ℓ, Hq.

The proof of Theorem 65 shows that, not only can we ﬁnd a partition Q of Y such that d Rp P, PQq ă ε, but we can choose Q such that for all η1 P Probp X ˆ Y q, we have d Rpp Pη1q Q, Pη1q ă ε. We will use this stronger fact to prove Theorem 69.

Theorem 65 immediately gives us a density result.

Corollary 67 (Classiﬁcation Density Theorem) Under d R, the set of classiﬁcation problems is dense in the space of compact problems.

5.4 Convergence of Empirical Problems

With the density of classiﬁcation problems in hand, we are ready to examine the eﬀect that limited sampling can have on a problem. One often does not have direct knowledge of the underlying joint law from which data is sampled; one only has access to a ﬁnite data set. In terms of our supervised learning problems, instead of accessing the joint law η of a problem P directly, we can only access the empirical measure ηn of η. Namely, given an i.i.d. set tpxi, yiqun i 1 sampled from the probability measure η, we set

i 1 δpxi,yiq.

The resulting problem with joint law ηn is called the empirical problem.

Deﬁnition 68 If P is a problem, the nth empirical problem induced by P is the random problem deﬁned by Pn : p X, Y, ηn, ℓ, Hq.

Being dependent on a random sample, ηn is a random measure. Consequently, Pn is a random problem. One would hope that as the size of the sample grows, the empirical problem Pn converges to the true problem P with probability 1. Under certain conditions, this is indeed true.

Theorem 69 (Convergence of the Empirical Problem) Let P be a problem satisfying the following:

1. H is compact with respect to dℓ,η.

2. Y is compact.

3. ℓis continuous.

M emoli, Vose, Williamson

4. The collection t|ℓh ℓh1|uh,h1PH forms a Glivenko-Cantelli class with respect to η.

Let Pn be the nth empirical problem induced by P. Then d Rp P, Pnq Ñ 0 almost surely.

The proof, found in Appendix A, uses Theorem 65 to reduce to the case where P is a classiﬁcation problem. We brieﬂy illuminate the Glivenko-Cantelli condition in Theorem 69 by providing simpler suﬃcient conditions.

Proposition 70 Deﬁne gh,h1px, yq : |ℓhpx, yq ℓh1px, yq|. Theorem 69 still holds if Condition 4 is replaced with either of the following conditions.

4a. The collection of subgraphs sgpgh,h1q ˇˇh, h1 P H ( (see Deﬁnition 12) has ﬁnite VC dimension.

4b. For any h0, h1 0 P H and ε ą 0, there exists an open neighborhood ph0, h1 0q P U Ď H ˆH such that ż sup ph,h1q PU gh,h1px, yqηpdxˆdyq ε ă ż gh0,h1 0px, yqηpdxˆdyq

ă ż inf ph,h1q PU gh,h1px, yqηpdxˆdyq ε.

Conditions 2 and 4a together imply condition 4, since the compactness of Y implies that the gh,h1 are uniformly bounded, and hence that Proposition 13 applies. Note that Condition 4a does not depend on the joint law η. Conditions 1 and 4b together imply condition 4 as well by a theorem of Gaenssler (1975). Theorem 69 is a convergence result. A quantitative result about the rate of convergence would be more useful. Certainly the convergence rate will depend on properties of P and P 1; the proof of Theorem 69 suggests that the proximity of P and P 1 to less complex problems is crucial.

Question 71 How does the rate of the convergence given by Theorem 69 depend on the properties of P and P 1? Can one bound the convergence rate for speciﬁc classes of problems?

6. The Lp-Risk Distance: Weighting Problems

In this section we introduce a variant of the Risk distance called the Lp-Risk distance which incorporates probability measures on the predictor sets. This additional information lets us prioritize certain predictors over others by weighting them more heavily, subsequently softening the Risk distance and improving its stability. The weighting can be interpreted as a Bayesian prior, so the perspective adopted in this section is broadly aligned with the principles of Bayesian statistics. While approaches such as Bayesian decision theory (Berger, 1985) incorporate elements like loss functions, prior work does not operate on a space of problems, nor does it attempt to endow such a space with a metric structure as we do in this paper.

Geometry and Stability of Supervised Learning Problems

6.1 Weighting Predictor Sets and the Lp-Risk Distance

Many of our results thus far have required assumptions of compact predictor sets or bounded loss functions. Without these assumptions, the Risk distance tends to be poorly-behaved. For instance, problems such as the linear regression in Example 19 tend to be inﬁnitely far away from problems of bounded loss under the Risk distance. One reason for this phenomenon is the supremum in the deﬁnition of the Risk distance. The supremum treats unreasonable predictors, such as those whose graphs are far from the support of the joint law, with the same consideration as more reasonable options. One might reasonably complain about this design decision. One solution to this problem is to weight the predictors, de-prioritizing extreme options by giving them less weight. These weights could come from prior knowledge or belief about which predictors might perform well or from a measure of complexity of each predictor (so that more complex predictors receive lower weights). Indeed, considering these weights as part of the speciﬁcation of a problem can be seen as a proxy for adding a regularization term to the constrained Bayes risk. More formally, we provide weights on the predictor set in the form of a probability measure. This lets us replace the supremum in the deﬁnition of d R with an Lp-type integral, eﬀectively softening the Risk distance.

Deﬁnition 72 A weighted problem is a pair p P, λq where

P is a problem whose predictor set H comes equipped with a Polish topology, and

λ is a probability measure on H.

When there is no chance of confusion, we may suppress the measure λ and denote the weighted problem by P.

Assumption 73 We make the following measurability assumption. Given a weighted problem p P, λq, consider the map

H ˆ X ˆ Y Ñ Rě0 given by

ph, x, yq ÞÑ ℓhpx, yq.

Since H, X, and Y are topological spaces, we can equip each space with the corresponding Borel σ-algebra. We can then equip the product H ˆ X ˆ Y with the product σ-algebra. For any weighted problem p P, λq, we henceforth assume that the above map ph, x, yq ÞÑ ℓhpx, yq is measurable with respect to this product σ-algebra.

Deﬁnition 74 Let p P, λq and p P 1, λ1q be weighted problems. For p P r1, 8s and for any couplings γ P Πpη, η1q and ρ P Πpλ, λ1q, deﬁne the Lp-risk distortion of γ and ρ to be

dis P,P 1,ppρ, γq : ˆż ˆż ˇˇℓphpxq, yq ℓ1ph1px1q, y1q ˇˇ dγpx, y, x1, y1q p dρph, h1q 1{p p P r1, 8q,

dis P,P 1,8pρ, γq : sup ph,h1q Psupprρs

ż ˇˇℓphpxq, yq ℓ1ph1px1q, y1q ˇˇ dγpx, y, x1, y1q p 8.

M emoli, Vose, Williamson

The Lp-Risk distance between weighted problems p P, λq and p P 1, λ1q is then deﬁned to be

d R,ppp P, λq, p P 1, λ1qq : inf ρ,γ dis P,P 1,ppρ, γq

where γ ranges over all couplings Πpη, η1q and ρ ranges over all couplings Πpλ, λ1q.

Theorem 75 The Lp-Risk distance is a pseudometric on the collection of all weighted problems. Furthermore, we have

(i) d R,p ď d R,q ď d R,8 for all 1 ď p ď q ď 8;

(ii) d R ď d R,8.

Remark 76 Property (i) is analogous to a similar property enjoyed by the Gromov-Wasserstein distance (M emoli, 2011, Theorem 5.1 (h)) and it follows from (M emoli, 2011, Lemma 2.2). Property (ii) states that the L8-Risk distance is never smaller than the standard Risk distance from Deﬁnition 26. This is analogous to the fact that the Gromov-Hausdorﬀdistance is bounded above by the L8-Gromov-Wasserstein distance (M emoli, 2011, Theorem 5.1 (b)). The proof of the triangle inequality for the Lp-Risk distance is identical to the proof of the same for the Risk distance, save for an application of Minkowski s inequality for the outer integral. As indicated above, the proof of the other claimed properties follows similar steps as those for the Gromov-Wasserstein distance and therefore we omit them.

Example 77 Just as we deﬁned the one-point problems

P pcq : pt u, t u, δp , q, c, t Idt uuq

for each c ě 0 in Example 20, we can deﬁne a corresponding one-point weighted problem by endowing t Idt uu with the point-mass probability measure. We will still use P to denote P p0q. Then, in a manner similar to Example 31, if p P, λq is a weighted problem, one can show

d R,pp P, P q ˆż

XˆY ℓphpxq, yq ηpdx ˆ dyq p λpdhq 1{p ˆż

H Rp P phq λpdhq 1{p .

This can be seen as a kind of weighted constrained Bayes risk of p P, λq. While the constrained Bayes risk of an unweighted problem P is the inﬁmal possible risk, the above expression d R,pp P, P q represents the pth moment of possible risks when a predictor is chosen at random according to λ. In particular, d R,1p P, P q is simply the average risk of a predictor chosen randomly according to λ. One can establish a connection with the usual constrained Bayes risk as follows. Assume that the problem P is such that its loss function ℓis bounded above by a constant ℓmax ą 0. Then,

d R,pp P, P pℓmaxqq ˆż

ℓmax RP phq pλpdhq 1{p .

If the support of λ is the whole H, we obtain, as p Ñ 8, that d R,pp P, P pℓmaxqq Ñ ℓmax Bp Pq.

Geometry and Stability of Supervised Learning Problems

The distance d R,pp P, P q from Example 77 will appear again in Theorem 82, so we establish a deﬁnition.

Deﬁnition 78 Given a weighted problem P p P, λq and 1 ď p ă 8, the p-diameter of P is given by

d R,pp P, P q ˆż

XˆY ℓphpxq, yq ηpdx ˆ dyq p λpdhq 1{p ˆż

H Rp P phq λpdhq 1{p .

The word diameter is chosen in analogy with metric geometry, where the diameter of a metric space can be found from its Gromov-Hausdorﬀdistance from the one-point metric space.

6.2 Connection to the Gromov-Wasserstein Distance

In Section 3.5, we saw two ways to turn metric measure spaces into problems, each of which relates the Gromov-Wasserstein distance to the Risk distance. In this section, we will draw a similar connection between the Gromov-Wasserstein distance and the Lp-Risk distance. Speciﬁcally, we will provide a way of turning metric measure spaces into weighted problems such that the L1-Risk distance between the resulting problems aligns with a well-known modiﬁcation of the Gromov-Wasserstein distance between the original spaces. Given any metric measure space p X, d X, µXq, deﬁne a weighted problem p PX, µXq,

PX : p X, X, p Xq7µX, d X, HXq,

X : X Ñ X ˆ X is the diagonal map x ÞÑ px, xq, and

HX is the set of all constant functions X Ñ X.

Here we think of µX as a measure on both X and HX since the two are in natural bijection.

Proposition 79 Let p X, d X, µXq and p Y, d Y , µY q be two metric measure spaces. Then,

d R,1p PX, PY q inf γPΠpµX,µY q ρPΠpµX,µY q

ż ż ˇˇd Xpx2, x1q d Y py2, y1q ˇˇ γpdx1ˆdy1q ρpdx2ˆdy2q. ( )

The proof of this proposition can be found in Appendix A. Use Fpγ, ρq to denote the double integral in the expression ( ). If we add the additional constraint that γ ρ, then the expression ( ) becomes 2d GW,1p X, Y q. As written, ( ) is a lower bound on 2d GW,1p X, Y q, which could be arrived at via a bilinear relaxation of the optimization problem posed by d GW,1. That is, one could begin with the objective function for the Gromov-Wasserstein optimization problem

F1pγq : ż ż ˇˇd Xpx2, x1q d Y py2, y1q ˇˇ γpdx1ˆdy1q γpdx2ˆdy2q,

M emoli, Vose, Williamson

which is quadratic in the transport plan γ, and decouple the two instances of γ to get the objective Fpγ, ρq, which is bilinear in two transport plans. This relaxation was considered by (M emoli, 2011, Section 7) for estimating the Gromov-Wasserstein distance via an alternate minimization procedure. Similar optimal transport problems, which simultaneously optimize two transport plans in a Gromov-Wasserstein-like objective function, appear in the literature. Indeed, when X and Y are ﬁnite, the expression ( ) is a special case of CO-Optimal transport (Vayer et al., 2020b). Sejourne et al. (2021) apply this relaxation technique, which they call the biconvex relaxation, to an unbalanced variant of the Gromov-Wasserstein distance. This inspired Beier et al. (2023) to explore a similar relaxation for a multi-marginal variant of the same problem. A theoretical study of the metric properties of the right-hand side of ( ) appears in (Chen et al., 2022, Appendix A.4).

6.3 Stability of Loss Proﬁle Distribution

Being softer than the Risk distance, the loss proﬁle is less stable under the Lp-Risk distance. To obtain a stability result, one cannot compare the loss proﬁle sets directly, but instead compare certain probability distributions on the loss proﬁle sets.

Deﬁnition 80 Let p P, λq be a weighted problem. Deﬁne FP : H Ñ Probp Rq by

FP phq : pℓhq7η.

The loss proﬁle distribution of P is then deﬁned to be Lp Pq : p FP q7λ.

Note that Lp Pq P Probp Probp Rqq is a probability measure on the space of real probability measures. More speciﬁcally, FP phq is the loss proﬁle of h, so Lp Pq describes what the loss proﬁle of a random predictor chosen according to λ is likely to be. While Theorem 37 shows that Lp Pq is stable under the Risk distance with respect to the Hausdorﬀdistance, the following analogous theorem shows that the measure Lp Pq is stable under the Lp-Risk distance with respect to the Wasserstein distance.

Theorem 81 Let P and P 1 be weighted problems. Then

dd W,1 W,p p Lp Pq, Lp P 1qq ď d R,pp P, P 1q

where dd W,1 W,p is the p-Wasserstein metric on Probp Probp Rqq with underlying metric d W,1 on Probp Rq.

The proof is in Appendix A.

6.4 Improved Density

A useful theorem in measure theory states that any probability measure µ on a Polish space X is tight, meaning that for any ε ą 0, there is a compact subset K Ď X with µp Kq ą 1 ε. In a slogan, Polish probability spaces are almost compact. Applying this principle to our setting suggests that any weighted problem should be close to one with a compact predictor set. This can be made rigorous with a density theorem.

Geometry and Stability of Supervised Learning Problems

Theorem 82 Let p ě 1 and let p P, λq be a weighted problem of ﬁnite p-diameter. Then there exists a compact weighted problem p P 1, λ1q with d R,pp P, P 1q ă ε.

Additionally, a result analogous to Theorem 65 still holds for the Lp-Risk distance.

Proposition 83 Let p ě 1 and let p P, λq be a compact weighted problem. Then there exists a weighted classiﬁcation problem p P 1, λ1q with d R,pp P, P 1q ă ε.

The proofs of both Theorem 82 and Proposition 83 can be found in Appendix A. Chaining these two density results together proves the following corollary.

Corollary 84 Under d R,p for p ě 1, weighted classiﬁcation problems are dense in the space of weighted problems of ﬁnite p-diameter.

Example 85 Consider the linear regression problem P from Example 19. If we endow the predictor set of P with a joint law satisfying some ﬁnite moment assumptions, the resulting weighted problem has ﬁnite p-diameter. Corollary 84 then tells us that P can be approximated arbitrarily well by weighted classiﬁcation problems under d R,p. More precisely, let P be a linear regression problem, and suppose the predictor set

H x ÞÑ a T x b ˇˇ pa, bq P Rn ˆ R (

is equipped with a probability measure λ. We can think of both λ and η as probability measures on Rn ˆ R. If

1. all n 1 marginals of η have ﬁnite second moment, and

2. all n 1 marginals of λ have ﬁnite 2p-th moment,

then diampp Pq ă 8. To prove this, we can write

diampp Pq 1{2 ˆż ˆż ℓpa T x b, yq ηpdx ˆ dyq p λpda ˆ dbq 1{2p

ż ˆż pa T x b yq2 ηpdx ˆ dyq 1{2 2p λpda ˆ dbq

By liberally applying Minkowski s inequality and simplifying, we can bound this expression by the sum

ˆż a2p i λpda ˆ dbq 1{2p ˆż x2 i ηpdx ˆ dyq 1{2

ˆż pbq2p λpda ˆ dbq 1{2p ˆż y2 ηpdx ˆ dyq 1{2 .

These integrals are the moments of η and λ that we have assumed to be ﬁnite.

M emoli, Vose, Williamson

6.5 Improved Stability

The Lp-Risk distance enjoys many of the same stability and density results as the Risk distance. Indeed, we will soon see that many of these results can be strengthened for the Lp-Risk distance. That is, the Lp-Risk distance enjoys stronger stability and density properties than the Risk distance. First we will establish that the Lp-Risk distance inherits some stability results from the Risk distance, saving the improvements for later. Many of our bounds on the Risk distance are proven using the bound

d Rp P, P 1q ď inf γPΠpη,η1q dis P,P 1p Rdiag, γq (2)

where Rdiag is the diagonal correspondence, which is valid whenever P and P 1 share a predictor set. Hence we can extend the results of those sections using the following lemma.

Lemma 86 Let p P, λq and p P 1, λq be weighted problems that share a predictor set H and predictor weights λ. Then

d R,ppp P, λq, p P 1, λqq ď inf γPΠpη,η1q dis P,P 1p Rdiag, γq

where Rdiag is the diagonal correspondence of H with itself.

Proof First write d R,ppp P, λq, p P 1, λqq ď dis P,P 1,ppρdiag, γq

where ρdiag is the diagonal coupling of λ with itself. Next we bound the Lp-risk distortion by the L8-risk distortion as follows:

dis P,P 1,ppρdiag, γq ď dis P,P 1,8pρdiag, γq

sup ph,h1q Psupprρdiags

ż ˇˇℓphpxq, yq ℓ1ph1px1q, y1q ˇˇ dγpx, y, x1, y1q

ď sup ph,h1q PRdiag

ż ˇˇℓphpxq, yq ℓ1ph1px1q, y1q ˇˇ dγpx, y, x1, y1q

dis P,P 1p Rdiag, γq,

where the second inequality follows from the fact that the support of ρdiag lies within the diagonal Rdiag Ď H ˆ H.

Condition 2 from Theorem 28, the Risk distance bounds of Sections 4.3 (Stability under bias), 4.4 (Stability under noise), and 5.4 (Convergence of empirical problems) are all proven via the diagonal bound (2).6 By Lemma 86, these results still hold if we replace the Risk distance with the Lp-Risk distance and replace both problems with weighted problems that share a common weight measure λ. That is, in the above results, we can replace each instance of d Rp P, P 1q with d R,ppp P, λq, p P 1, λqq and they will still hold true.

6. In addition to the diagonal bound, the proof of Theorem 69 also uses the density of classiﬁcation problems (Theorem 67). However, the proof of the analogous result for weighted problems is still valid if we replace this step with an application of the density of weighted classiﬁcation problems (Theorem 83).

Geometry and Stability of Supervised Learning Problems

Moreover, some of our results for the Risk distance can be strengthened for the Lp-Risk distance. Indeed, we have already seen that the density results of Section 5.3 can be strengthened into Theorem 82. We can strengthen some stability results in the same manner. For instance, the Lp-Risk distance satisﬁes the following version of Condition 2 in which the supremum is replaced by an integral.

Proposition 87 Let p P, λq and p P 1, λq be weighted problems with

P p X, Y, η, ℓ, Hq

P 1 p X, Y, η, ℓ1, Hq

d R,pp P, P 1q ď ˆż ˆż ˇˇℓphpxq, yq ℓ1phpxq, yq ˇˇ ηpdxˆdyq p λpdhq 1{p .

Proof Bound the Lp-Risk distance by selecting the diagonal couplings.

We can also modify Theorem 51 by replacing the Hausdorﬀdistance with a Wasserstein distance.

Proposition 88 Let p P, λq and p P, λ1q be weighted problems with

P : p X, Y, η, ℓ, Hq

P 1 : p X, Y, η, ℓ, H1q.

Let i : H Ñ H Y H1 and j : H1 Ñ H Y H1 be the inclusion maps, and endow H Y H1 with the pseudometric dℓ,ηph1, h2q : }ℓh1 ℓh2}L1pηq introduced in Section 5.2. Then d R,pp P, P 1q ď ddℓ,η W,ppi7λ, j7λ1q.

Proof Selecting the diagonal coupling between η and itself gives us the bound

d R,pp P, P 1q ď inf ρPΠpλ,λ1q

ˆż ˆż ˇˇℓhpx, yq ℓh1px, yq ˇˇ ηpdxˆdyq p ρpdhˆdh1q 1{p

inf ρPΠpi7λ,j7λ1q

ˆż ˆż ˇˇℓhpx, yq ℓh1px, yq ˇˇ ηpdxˆdyq p ρpdhˆdh1q 1{p

inf ρPΠpi7λ,j7λ1q

ˆż dℓ,ηph, h1q p ρpdhˆdh1q 1{p ddℓ,η W,ppi7λ, j7λ1q

as desired.

Proposition 44 can be modiﬁed for weighted problems as well. We ﬁrst construct an analog of the metric sℓ,H from Section 4.3. Replacing the supremum over H in the deﬁnition of sℓ,H with an Lp-style integral against λ, we arrive at the deﬁnition

sℓ,λ,pppx, yq, px1, y1qq : ˆż

ˇˇℓphpxq, yq ℓphpx1q, y1q ˇˇp λpdhq 1{p .

M emoli, Vose, Williamson

In other words, to determine the distance between px, yq and px1, y1q, we ask how diﬀerent the loss on the observation px, yq will be from the loss on px1, y1q for an average predictor. We can now state our stability result.

Proposition 89 Let p P, λq be a weighted problem, and let p P 1, λq be a weighted problem that is identical to P except possibly for its joint law η1. Then

d R,pp P, P 1q ď dsℓ,λ,p W,1 pη, η1q.

Proof By selecting the diagonal coupling between λ and itself and applying Minkowski s inequality for integrals, we can bound d R,p by

d R,pp P, P 1q ď inf γPΠpη,η1q

ˆż ˆż ˇˇℓphpxq, yq ℓphpx1q, y1q ˇˇ γpdxˆdyˆdx1ˆdy1q p λpdhq 1{p

ď inf γPΠpη,η1q

ż ˆż ˇˇℓphpxq, yq ℓphpx1q, y1q ˇˇp λpdhq 1{p γpdxˆdyˆdx1ˆdy1q

dsℓ,λ,p W,1 pη, η1q,

giving us the desired bound.

Lastly, we turn our attention to stability under noise. We have shown in Section 4.4 that the Risk distance is stable with respect to noise in the joint law when ℓis a metric. By contrast, for a general ℓ, the Risk distance d R can be highly sensitive to changes in the joint law. The following example demonstrates this sensitivity.

Example 90 Deﬁne a problem

P pt u, R, δ0, ℓ, Hq

where ℓpy, y1q py y1q2, and H is the collection of all functions t u Ñ R, making H itself homeomorphic to R.7 Now, for any ε ą 0, deﬁne

Pε ˆ t u, R, 1

2pδ0 δεq, ℓ, H .

We can think of Pε as P with some noise; when making an observation, there is a 50% of observing the correct y-value of 0, and a 50% chance of making a small error and reading a y-value of ε. We would hope that Pε would converge to P under d R as ε Ñ 0, but this is not true. There is only one coupling between δ0 and 1{2pδ0 δεq, so the Risk distance is

d Rp P, Pεq inf RPCp H,Hq sup pa,bq PR

ż ż ˇˇpa yq2 pb y1q2| δ0pdyq ˆ1

inf RPCp H,Hq sup pa,bq PR

ˇˇa2 b2ˇˇ 1

ˇˇa2 pb εq2ˇˇ

7. In other words, we can identify H with R.

Geometry and Stability of Supervised Learning Problems

2 inf RPCp H,Hq sup pa,bq PR

ˇˇb2 pb εq2ˇˇ

2 sup b PR |2bε ε2| 8.

The metric d R,p enjoys a stronger stability than d R with respect to noise in a problem s joint law. We use a more general noise model than the one outlined in Section 4.4, allowing for noise in the X direction as well as the Y direction. More speciﬁcally, we let our noise be given by any Markov kernel N : X ˆ Y Ñ Probp X ˆ Y q. That is, given a problem P, deﬁne PN to be the same problem but with joint law on X ˆ Y given by

η Np Aq : ż

XˆY Npx, yqp Aq ηpdx ˆ dyq

for all measurable A Ď X ˆ Y . We justify the use of Markov kernel composition notation by thinking of the measure η as a Markov kernel η : t u Ñ Probp X ˆ Y q. We can now state our improved result about stability under noise.

Theorem 91 Let p P, λq be a weighted problem, and N : X ˆ Y Ñ Probp X ˆ Y q a Markov kernel. Then

d R,pp P, PNq ď dsℓ,λ,p W,1 pδId XˆY , Nq.

Here δId XˆY is the Markov kernel sending px, yq to δpx,yq, and dsℓ,λ,p W,1 is the Wasserstein metric on Markov kernels X ˆ Y Ñ Probp X ˆ Y q when X ˆ Y is equipped with the joint law η and the metric sℓ,λ,p.

The proof can be found in Appendix A.

Example 92 Let p P, λq be a weighted problem with P p R, R, η, ℓ, Hq, where

ℓpy, y1q : py y1q2,

H : tfa|a P Ru, where fapxq : ax,

λ is the standard normal distribution.

Let N : RˆR Ñ Probp RˆRq be a noise kernel that applies x-dependent noise in the vertical direction. That is, if iy x : R Ñ R2 is the function iy xpy1q px, y y1q, set

Npx, yq piy xq7Mpxq

for some Markov kernel M : R Ñ R. Assume that for all x, Mpxq is a symmetric measure with ﬁnite 2nd moment vpxq and 4th moment kpxq. Theorem 91 gives us d R,2p P, PNq ď dsℓ,λ,2 W,1 pδId R2, Nq. The right hand side is diﬃcult to exactly compute, so we will compute the larger quantity dsℓ,λ,2 W,2 pδId R2, Nq. We ﬁrst produce an explicit formula for sℓ,λ,2ppx, yq, px1, y1qq.

s2 ℓ,λ,2ppx, yq, px1, y1qq ż ℓhpx, yq ℓhpx1, y1q 2 λpdhq

pax yq2 pax1 y1q2 2 e a2{2

M emoli, Vose, Williamson

After expanding the polynomial term, the integral can be split into ﬁve terms that represent various moments of the standard normal distribution. We arrive at the polynomial

s2 ℓ,λ,2ppx, yq, px1, y1qq 3px2 x12q2 4pxy x1y1q2 py2 y12q2 2px2 x12qpy2 y12q.

Next we can compute dsℓ,λ,2 W,2 pδpx,yq, Npx, yqq for any ﬁxed px, yq P R2. Indeed, there is only one coupling between δpx,yq and Npx, yq, so we can write

dsℓ,λ,2 W,2 2 pδpx,yq, Npx, yqq ż s2 ℓ,λ,2ppx, yq, px1, y1qq Npx, yqpdx1ˆdy1q

ż s2 ℓ,λ,2ppx, yq, px1, y1qq piy xq7Mpxqpdx1ˆdy1q

ż s2 ℓ,λ,2ppx, yq, px, y1 yqq Mpxqpdy1q

ż 4x2y12 y12py1 2yq2 Mpxqpdy1q.

Again, expanding the polynomial terms and splitting the integral reduces the problem to moment computations. Recalling that Mpxq is symmetric and hence has vanishing odd moments, we arrive at the expression dsℓ,λ,2 W,2 2 pδpx,yq, Npx, yqq 4vpxqx2 4vpxqy2 kpxq.

Integrating against η gives us an expression for dsℓ,λ,2 W,2 2 pδId R2, Nq, and hence an upper

bound for d2 R,pp P, PNq.

d2 R,pp P, PNq ď ż 4vpxq ˆż y2 βpxqpdyq 4x2vpxq kpxq αpdxq

where α is the ﬁrst marginal of η, and β is the disintegration. We take a moment to interpret the integrand of the outer integral. The integral ş y2βpxqpdyq is the second moment of βpxq. Hence the term vpxq ş y2βpxqpdyq measures the correlation between the spread of βpxq and the noise σ2pxq applied at x. The term x2σ2pxq measures the application of noise far from the origin. Hence this bound suggests that noise has an eﬀect on the squared Risk distance proportional to its 4th moment, with an additional eﬀect proportional to its 2nd moment if applied at x-values far from zero, or where βpxq is already spread out.

7. The Connected Risk Distance: Risk Landscapes

Having discussed the weighted variant of the Risk distance, we now introduce and study one more variant, called the Connected Risk distance. This variant is motivated by the observation that the Risk distance is insensitive to the contours of a problem s risk landscape. We will show that by strengthening the Risk distance into the Connected Risk distance, a topological descriptor of the risk landscape called the Reeb graph can be made to exhibit stability under the Connected Risk distance without breaking many of the stability results that the Risk distance enjoys. The Reeb graph of the risk landscape of a problem P encodes information about the inherent complexity associated to solving P.

Geometry and Stability of Supervised Learning Problems

7.1 Insensitivities of the Risk distance

We saw in section 4.1 that the constrained Bayes risk of a problem is stable under the Risk distance. The constrained Bayes risk Bp Pq of a problem P is a suﬃciently descriptive invariant if we plan to solve P by exhaustively searching the predictor set H for a predictor with minimal risk. If we instead wish to use a more practical search method like gradient descent, then Bp Pq is a woefully incomplete descriptor. The speciﬁc structure of the risk function RP : H Ñ Rě0 becomes crucially important, rather than just its inﬁmum. Is RP rife with local minima? What are the risks of the local minima? What are their basins like? The Risk distance is not sensitive to these kinds of concerns.

Example 93 Deﬁne a problem

P : pr0, 1s, t0, 1u, η, ℓ, Hq

and a family of problems

Pt : pr0, 1s, t0, 1u, η, ℓ, Htq, t P p0, 1s,

1. ℓis the 0-1 loss ℓpy, y1q : 1y y1.

2. η is the uniform distribution on r0, 1s ˆ t0u.

3. H : 1r0,aq ˇˇa P r0, 1s ( Y 1pa,1s ˇˇa P r0, 1s ( .

4. Ht : 1r0,aq ˇˇa P rt, 1s ( Y 1pa,1s ˇˇa P r0, 1s ( .

Here we take the convention that r0, 0q p1, 1s H. We claim that d Rp P, Ptq ď t. Indeed, by Theorem 51,

d Rp P, Ptq ď d L1pηq H tℓh|h P Hu , ℓh1 ˇˇh1 P Ht (

sup h PH inf h1PHt }ℓh ℓh1}L1pηq

sup h PHz Ht inf h1PHt

ˇˇℓphpxq, 0q ℓph1pxq, 0q ˇˇ dx

sup a Pr0,tq inf h1PHt

ˇˇ11r0,aqpxq 0 1h1pxq 0 ˇˇ dx

sup a Pr0,tq inf h1PHt

ˇˇ1r0,aqpxq h1pxq ˇˇ dx.

Choosing h1 0 gives us the upper bound bound

ď sup a Pr0,tq

0 1r0,aqpxq dx t.

Hence d Rp Pt, Pq Ñ 0 as t Ñ 0. At the same time, the risk landscapes of P and Pt look very diﬀerent for any t ą 0. These risk landscapes are depicted in Figure 6. Under the metric

M emoli, Vose, Williamson

Figure 6: The risk landscapes for P and Pt in Example 93.

dℓ,η, H is topologically a circle. The risk function RP : H Ñ Rě0 has one local minimum, which is achieved at the function hpxq 0. For t ą 0, Ht is formed from H by removing a small segment of the circle, making Ht topologically a closed interval under dℓ,η. Then RPt has a local minimum at each end of the endpoints of the interval, one at hpxq 0 and the other at hpxq 1r0,tq.

In this section, we strengthen the Risk distance to exert more control over the risk landscape. Inspired by Bauer et al. (2021), we restrict the correspondences in the deﬁnition of d R to only those satisfying a certain connectivity property.

7.2 The Connected Risk distance

If H, H1 are topological spaces, deﬁne Cconp H, H1q to be the set of correspondences R Ď HˆH1

such that the projections R Ñ H and R Ñ H1 are inverse connected, meaning the preimage of any connected set is connected.

Deﬁnition 94 (Connected Risk distance)

d RCp P, P 1q : inf γPΠpη,η1q RPCconp H,H1q

dis P,P 1p R, γq

inf γPΠpη,η1q RPCconp H,H1q

sup ph,h1q PR

ż ˇˇℓhpx, yq ℓ1 h1px1, y1q ˇˇ γpdxˆdyˆdx1ˆdy1q.

Note that the deﬁnition of d RC is the same as that of d R (Deﬁnition 26), except that the correspondence R is chosen from Cconp H, H1q. We claim d RC is a pseudometric. The only nontrivial property is the triangle inequality, which requires a modiﬁed version of the gluing lemma.

Lemma 95 Let H1, H2, H3 be topological spaces, and let R1,2 P Cconp H1, H2q, R2,3 P Cconp H2, H3q. Then there exists an R Ď H1 ˆ H2 ˆ H3 such that the projection maps

Geometry and Stability of Supervised Learning Problems

onto each of the three coordinates are all inverse connected, and such that proj1,2p Rq R1,2 and proj2,3p Rq R2,3.

With this lemma in hand, we can prove the triangle inequality in exactly the same way that we did for d R. The proof of Lemma 95 can be found in Appendix A.

Remark 96 Note that d RC is a larger metric than d R, since R is inﬁmized over Cconp H1, H2q Ď Cp H1, H2q. Therefore we would expect d RC to be less stable than d R. Some of the stability results of d R were proven by selecting R to be the diagonal coupling between a predictor set H and itself. Since Cconp H, Hq still contains the diagonal coupling, these proofs apply just as well to d RC, just as we saw with the Lp-Risk distance in Section 6.5. Speciﬁcally, Condition 2 from Section 3.3, as well as the results of Sections 4.3 (stability under bias) and 4.4 (stability under noise) still apply to the Connected Risk distance.

Additionally, since d RC is larger than d R, the problem descriptors stable under the former are also stable under the latter. In particular, the results of Section 4.1 and Section 4.2 still hold for d RC; still stable under the Connected Risk distance are the loss proﬁle set and consequently the Bayes risk, as well as the Rademacher complexity for an arbitrarily large sample size.

7.3 Risk Landscapes and Reeb Graphs

We will show that a certain topological summary of the risk landscape of a problem, known as the Reeb graph, is stable under the Connected Risk distance.

Deﬁnition 97 (Induced Reeb Graph) Let X be a topological space and f : X Ñ R a continuous function. Deﬁne an equivalence relation on X by declaring that x y if and only if x and y belong to the same connected component of some level set of f. Deﬁne F : X{ . Let q : F Ñ R be the unique continuous function that factors through f. The pair p F, qq is called the Reeb graph induced by f.

Typically, we picture a Reeb graph p F, qq by thinking of the function q as a height function; we imagine that F is situated in space so that qpxq is the altitude of the point x. See Figure 7. In the computer science literature, many authors deﬁne the Reeb graph by declaring that x y if and only if x and y lie in the same path-connected component instead of the same connected component. While these deﬁnitions are not equivalent in general, they are equivalent under reasonable topological assumptions, such as when the level sets of f are locally path-connected. Reeb graphs are used as summaries of landscapes induced by real functions. While originally designed by Georges Reeb to study functions on smooth manifolds (Reeb, 1946) and independently rediscovered by Kronrod (1950), Reeb graphs were ﬁrst popularized as a computational tool by Shinagawa et al. (1991). Since then, Reeb graphs have grown in popularity, ﬁnding applications in the comparison (Hilaga et al., 2001; Escolano et al., 2013), parameterization (Patan e et al., 2004; Zhang et al., 2005), and denoising (Wood et al., 2004) of shapes, in symmetry detection (Thomas and Natarajan, 2011), and in sketching of static (Chazal and Sun, 2013; Ge et al., 2011) and time-varying (Edelsbrunner et al., 2004) data, among many others. See Biasotti et al. (2008); Yan et al. (2021) for surveys.

M emoli, Vose, Williamson

Figure 7: An illustration of a function f on the unit square and its induced Reeb graph. The two local minima of the function produce downward-pointing leaves of equal height on the Reeb graph. Similarly, the three local maxima produce upward-pointing leaves, with the lower maximum producing a lower leaf.

While it is easiest to imagine Reeb graphs as topological graphs8 with attached height functions, one should be wary that, in general, the Reeb graph of a function need not admit the structure of a topological graph. Thankfully, many classes of spaces and functions are known to induce Reeb graphs which are also topological graphs. One list of suﬃcient conditions is provided by de Silva et al. (2016, Example 2.2).

Deﬁnition 98 Let P be a problem. The Reeb graph of P, denoted Reebp Pq, is the Reeb graph induced by the risk function RP : H Ñ Rě0.

The Reeb graph of a problem P is a summary of the risk landscape. (See Figure 8.) For instance, the height of the lowest point of Reebp Pq is the constrained Bayes risk. If RP has a unique minimizer, then Reebp Pq will have a unique lowest point. Conversely, if Reebp Pq has a unique lowest point, then the minimizers of RP form a connected set. Local minima appear as downward-pointing leaves, with the height of the edges corresponding to the depth of the basins around those minima. Similarly, local maxima are represented by upward-pointing leaves. The Reeb graph of P therefore captures information that can be useful for quantifying the complexity of the problem P. Reeb graphs can be compared via any of several related metrics. The largest such metric in popular use is called the universal distance (Bauer et al., 2021).

Deﬁnition 99 (Universal Distance on Reeb graphs) The universal distance between two Reeb graphs p F, fq and p G, gq is given by

d Upp F, fq, p G, gqq : inf Z, pf,pg:ZÑR } pf pg}8

8. The following deﬁnition will suﬃce: A topological graph is a topological space produced from a graph by realizing the vertices as isolated points and the edges as arcs connecting them; see (Hatcher, 2005, Chapter 0) for details.

Geometry and Stability of Supervised Learning Problems

Figure 8: The Reeb graph of some problem P. Based on the downward-pointing leaves of the Reeb graph, this problem s risk landscape seems to have three basins. The height of the lowest point in the Reeb graph corresponds to the constrained Bayes risk Bp Pq.

where the inﬁmum ranges over all spaces Z and functions pf, pg : Z Ñ R such that pf induces the Reeb graph p F, fq and pg induces the Reeb graph p G, gq.

Even with respect to the universal distance, the Reeb graph of a problem is stable under the Connected Risk distance.

Theorem 100 Let P and P 1 be problems. Then

|Bp Pq Bp P 1q| ď d Up Reebp Pq, Reebp P 1qq ď d RCp P, P 1q.

Hence the Reeb graph of a problem is a stronger descriptor than the constrained Bayes risk while remaining stable under the Connected Risk distance. The proof can be found in Appendix A.

Example 101 Let P be a problem whose loss ℓis a metric, such as a 0-1 loss in a classiﬁcation problem, mean absolute error in a regression problem, or a Wasserstein distance in a probability estimation problem. Suppose that we expect noise in the labels of our data such that the average distance under ℓbetween the true label and the observed label is ε1. Write η1 for the noisy distribution from which the observed data is drawn. Suppose also that the loss function ℓis diﬃcult to exactly compute or has poor optimization properties, but there is an approximation ℓ1 that does not share the same issues and is ε2-close to ℓin the supremum norm. Replacing η with the noisy η1 and ℓwith the approximation ℓ1 could introduce new local minima to the risk function. We would like to understand how bad these

M emoli, Vose, Williamson

local minima can be. Let P 1 p X, Y, η1, ℓ, Hq and P 2 p X, Y, η1, ℓ1, Hq. We aim to bound the universal distance between the Reeb graphs of the idealized problem P and the corrupted problem P 2. Since Theorem 49 applies to the Connected Risk distance (Remark 96), we have

d RCp P, P 1q ď ε1.

Since d RC also satisﬁes Condition 2 from Section 3.3, we also have

d RCp P 1, P 2q ď max h PH

ż ˇˇℓphpxq, yq ℓ1phpxq, yq ˇˇ ηpdxˆdyq ď ε2.

Then, using the triangle inequality for d RC and Theorem 100, we can write the bound

d Up Reebp Pq, Reebp P 1qq ď d RCp P, P 2q ď d RCp P, P 1q d RCp P 1, P 2q ď ε1 ε2.

In other words, if we replace the idealized problem P with the corrupted problem P 2, the Reeb graph can change by at most ε1 ε2 in the universal distance. Hence any downward-pointing leaf of Reebp Pq of depth more than 2pε1 ε2q from top to bottom has a corresponding downward-pointing leaf in Reebp P 2q, and the diﬀerence in the heights of the tips of two corresponding leaves is at most ε1 ε2. We conclude that a basin of depth greater than 2pε1 ε2q in the risk landscape of the original problem corresponds to some basin in the corrupted problem, and the risk at the bottom of two corresponding basins diﬀer by at most ε1 ε2. Similarly, any new downward-pointing leaves in the Reeb graph created by corrupting P to get P 2 can be of length at most 2pε1 ε2q from top to bottom, so any new basins introduced by the corruption can be of depth at most 2pε1 ε2q from the deepest to highest point of the basin.

8. Discussion

Machine learning is currently engaged in a rapid and wide-ranging exploration of its problem space. This paper proposes a guiding and organizing principle, focusing on supervised learning. As the ﬁeld grows explosively, such structure is increasingly needed to make sense of emerging approaches. In this work, we argue that, given a notion of distance between supervised learning problems, various corruptions and modiﬁcations of problems can be seen from a geometric perspective. We propose the Risk distance and two variants thereof as reasonable choices for this distance notion, bolstering our claims with a garniture of stability results. The development in this paper can be seen as a direct descendent of earlier work in the machine learning community on reductions between learning problems, surrogate losses and diﬀerent forms of regularization. The present work has opened the way for further understanding of machine learning analogous to several productive, related research programs. The idea of the function space lies at the center of the abstract theory of functional analysis (Birkhoﬀand Kreyszig, 1984, page 259), the development of which was highly productive for 20th century mathematics (Dieudonn e, 1981). Similarly productive for metric geometry was the work of Gromov (1981), which created the metric space of all metric spaces. Other similar examples include category theory, which is concerned with putting relational structures on collection of objects (Mac Lane, 2013), and the theories of optimal transport (Villani, 2008) and information geometry (Amari, 2016), which both propose geometric spaces of probability measures.

Geometry and Stability of Supervised Learning Problems

Common among all of these examples is the viewpoint that to understand an object, it is productive to place it in a larger space of similar objects. Following this viewpoint, the present work opens the door to similar techniques by establishing candidates for spaces of supervised learning problems. In particular, the deﬁnition of the Risk distance directly echoes the motifs of an established lineage of distances from metric geometry and optimal transport (Section 2.3). We have shown that the geometry is meaningful; the geometric notion of a geodesic is related to optimal couplings and correspondences (Section 5.1), and that the topological density of classiﬁcation problems (Section 5.3) is useful in proving the convergence of empirical problems (Section 5.4). Throughout Section 6, the weighting λ placed on the set of predictors can be interpreted as a Bayesian prior an approach aligned with the principles of Bayesian statistics. However, classical Bayesian frameworks do not consider a space of problems or attempt to endow it with a metric structure, as we do here. While Bayesian decision theory (Berger, 1985) incorporates elements such as loss functions, it does not address the geometry of the problem space. It is conceivable that the present work may oﬀer new perspectives in that direction. More generally, we are optimistic that further geometric, topological, and even categorical studies of the space of problems can shed light on questions of interest in supervised learning. For example, the following concrete research directions appear to be natural candidates for exploration through the lens of the ideas developed in this paper:

Convergence Rates for the Empirical Problem. We posed Question 71 aimed at making Theorem 69, on the convergence of the empirical problem, more explicit and quantitative.

Leverage the framework to design broader, practical reductions between learning problems, allowing more components of ML tasks to vary. As shown in Section 5.3, classiﬁcation problems are dense in the sense of the Risk distance within the space of all compact problems. This suggests a structural connection between classiﬁcation and regression problems, despite their apparent diﬀerences. Conversely, it is natural to investigate reductions in the opposite direction: for example, reducing a classiﬁcation problem with severe label noise to a regression problem with biased sampling, where both problems are close in Risk distance. Such a reduction would enable robust regression techniques to be repurposed for noisy classiﬁcation tasks, facilitating the transfer of insights across distinct problem classes.

Analyze the robustness of other theoretical results e.g., fast rates of convergence through the lens of the Risk distance (or suitable variants thereof). In particular, is fast convergence a stable property under appropriate structural assumptions on the problems?

Assess the empirical value of the optimal couplings and correspondences produced by our theory. An optimal coupling and correspondence (as studied in Section 5.2) between two problems provide a kind of dictionary relating the two problems. What new insights can be drawn from such a dictionary?

Investigate concrete models of sampling bias within our framework, such as distribution shift (Qui nonero-Candela et al., 2009) and selection bias (Meng, 2018), which arise in many practical learning scenarios.

M emoli, Vose, Williamson

Acknowledgments

The authors thank Dr. Mark Reid for interesting discussions on topics related to this work that took place during 2010-2011. The authors also thank Dr. Nan Lu for comments on a draft. F.M. and B.V. were partially supported by the NSF under grants CCF-1740761, DMS1547357, CCF-2310412 and RI-1901360. R.W. was supported by NICTA and the Deutsche Forschungsgemeinschaft under Germany s Excellence Strategy EXC number 2064/1 Project number 390727645.

Appendix A. Relegated Proofs

A.1 Proofs and Details from Section 3

Proof [Theorem 27] The only nontrivial pseudometric axiom to prove is the triangle inequality. Furthermore, the claim that d R vanishes on weakly isomorphic problems will follow from the triangle inequality and the fact that d R satisﬁes Condition 1 (Theorem 28). That is, if P 2 is common simulation of P and P 1, then d Rp P, P 1q ď d Rp P, P 2q d Rp P 2, P 1q 0 0. Hence we need only prove that d R satisﬁes the triangle inequality. Let Pi p Xi, Yi, ηi, ℓi, Hiq for i 1, 2, 3. Let γ1,2 P Πpη1, η2q, γ2,3 P Πpη2, η3q, R1,2 P Cp H1, H2q, and R2,3 P Cp H2, H3q be arbitrary. We apply the the gluing lemma (Lemma 1) and its counterpart for correspondences (Lemma 2). Let γ be a gluing of γ1,2 and γ1,3, and let γ1,3 be its marginal on the ﬁrst and third coordinates. Similarly, let R be a gluing of R1,2 and R2,3, and let R1,3 be its image on its ﬁrst and third components. We can then write

dis P,P 1p R1,3, γ1,3q sup ph1,h3q PR1,3

ż ˇˇℓ1ph1px1q, y1q ℓ3ph3px3q, y3q ˇˇ γ1,3pdx1ˆdy1ˆdx3ˆdy3q

sup ph1,h2,h3q PR

ż ˇˇℓ1ph1px1q, y1q ℓ3ph3px3q, y3q ˇˇ γpdx1ˆdy1ˆdx2ˆdy2ˆdx3ˆdy3q

ď sup ph1,h2,h3q PR

ż ˇˇℓ1ph1px1q, y1q ℓ2ph2px2q, y2q ˇˇ γpdx1ˆdy1ˆdx2ˆdy2ˆdx3ˆdy3q

sup ph1,h2,h3q PR

ż ˇˇℓ2ph2px2q, y2q ℓ3ph3px3q, y3q ˇˇ γpdx1ˆdy1ˆdx2ˆdy2ˆdx3ˆdy3q

sup ph1,h2q PR1,2

ż ˇˇℓ1ph1px1q, y1q ℓ2ph2px2q, y2q ˇˇ γ1,2pdx1ˆdy1ˆdx2ˆdy2q

sup ph2,h3q PR2,3

ż ˇˇℓ2ph2px2q, y2q ℓ3ph3px3q, y3q ˇˇ γ2,3pdx2ˆdy2ˆdx3ˆdy3q

dis P,P 1p R1,2, γ1,2q dis P,P 1p R2,3, γ2,3q.

Since we can construct such a γ1,3 and R1,3 from any given γ1,2, γ2,3, R1,2, R2,3, we can take the inﬁmum of each side to get the desired inequality.

Proof [Theorem 28] First we demonstrate that the Risk distance does indeed satisfy Conditions 1 and 2. Let P 1 be a simulation of P via the maps f1 : X1 Ñ X and f2 : Y 1 Ñ Y .

Geometry and Stability of Supervised Learning Problems

Deﬁne f : f1 ˆ f2. Since f7η1 η, f induces a coupling γf P Πpη1, ηq. Deﬁne R Ď H1 ˆ H to be all the pairs ph1, hq such that h f1 f2 h1. Since f induces a simulation, R is a correspondence. By deﬁnition of the Risk distance, we can now write

d Rp P, P 1q ď sup ph1,hq PR

ż ˇˇℓ1ph1px1q, y1q ℓphpxq, yq ˇˇ γfpdx1ˆdy1ˆdxˆdyq

sup ph1,hq PR

ż ˇˇℓ1ph1px1q, y1q ℓph f1px1q, f2py1qq ˇˇ η1pdx1ˆdy1q.

We claim the integrand is identically zero. Indeed,

ℓph f1px1q, f2py1qq ℓpf2 h1px1q, f2py1qq ℓ1ph1px1q, y1q.

Hence d Rp P, P 1q 0. Let P p X, Y, η, ℓ, Hq and P 1 p X, Y, η, ℓ1, Hq be any pair of problems that diﬀer only in their loss functions. We can bound d Rp P, P 1q by selecting the diagonal coupling between η and itself, and the diagonal correspondence between H and itself. This directly yields Condition 2. Now suppose d is a pseudometric satisfying Conditions 1 and 2, and let P1 and P2 be problems. Let γ P Πpη1, η2q and R P Cp H1, H2q be arbitrary. Deﬁne two new problems

P 1 1 : p X1 ˆ X2, Y1 ˆ Y2, γ, ℓ1 1, Rˆq

P 1 2 : p X1 ˆ X2, Y1 ˆ Y2, γ, ℓ1 2, Rˆq,

ℓ1 1ppy1, y2q, py1 1, y1 2qq : ℓ1py1, y1 1q,

ℓ1 2ppy1, y2q, py1 1, y1 2qq : ℓ2py2, y1 2q,

Rˆ : th1 ˆ h2|ph1, h2q P Ru.

Then P 1 1 is a simulation of P1 via the projection maps X1 ˆ X2 Ñ X1 and Y1 ˆ Y2 Ñ Y1, so by Condition 1, dp P1, P 1 1q 0. Similarly, dp P2, P 1 2q 0. Combining this with Condition 2, we get

dp P1, P2q dp P 1 1, P 1 2q

ď max h1ˆh2PRˆ

ż ˇˇℓ1 1pph1 ˆ h2qpx1, x2q, py1, y2qq ℓ1 2pph1 ˆ h2qpx1, x2q, py1, y2qq ˇˇ γpdx1ˆdx2ˆdy1ˆdy2q

max ph1,h2q PR

ż ˇˇℓ1ph1px1q, y1q ℓ2ph2px2q, y2q ˇˇ γpdx1ˆdx2ˆdy1ˆdy2q

dis P1,P2p R, γq.

Taking an inﬁmum over all choices of γ and R ﬁnishes the proof.

M emoli, Vose, Williamson

Proof [Proposition 29] Let P and P 1 be weakly isomorphic, and let P 2 be a common simulation of both P and P 1 via the maps

Deﬁne the map

m : X2 ˆ Y 2 Ñ X ˆ Y ˆ X1 ˆ Y 1

px2, y2q ÞÑ pf1px2q, f2py2q, g1px2q, g2py2qq,

which, we note, makes the following diagram commute:

X ˆ Y ˆ X1 ˆ Y 1

X ˆ Y X1 ˆ Y 1

f1ˆf2 g1ˆg2 m

Deﬁne γ : m7η2. Note that, since the above diagram commutes, γ is a coupling between η and η1. Also, let φ : H2 Ñ H be such that ℓ2 h2 ℓφph2q pf1 ˆf2q holds η2-almost everywhere. Deﬁne ψ : H2 Ñ H1 similarly. Then

R : pφph2q, ψph2qq ˇˇh2 P H2(

is a correspondence between H and H1. Finally, compute

dis P,P 1p R, γq sup ph,h1q PR

ż ˇˇℓhpx, yq ℓ1 h1px1, y1q ˇˇ γpdxˆdyˆdx1ˆdy1q

sup ph,h1q PR

ż ˇˇℓh π12px, y, x1, y1q ℓ1 h1 π34px, y, x1, y1q ˇˇ γpdxˆdyˆdx1ˆdy1q

sup ph,h1q PR

ż ˇˇℓh π12 mpx2, y2q ℓ1 h1 π34 mpx2, y2q ˇˇ η2pdx2ˆdy2q

sup ph,h1q PR

ż ˇˇℓhpf1px2q, f2py2qq ℓ1 h1pg1px2q, g2py2qq ˇˇ η2pdx2ˆdy2q

ż ˇˇℓφph2qpf1px2q, f2py2qq ℓ1 ψph2qpg1px2q, g2py2qq ˇˇ η2pdx2ˆdy2q

ż ˇˇℓ2 h2px2, y2q ℓ2 h2px2, y2q ˇˇ η2pdx2ˆdy2q 0.

Now suppose that there exist R P Cp H, H1q and γ P Πpη, η1q such that dis P,P 1p R, γq 0. Construct a new problem

P 2 p X ˆ X1, Y ˆ Y 1, τ7γ, ℓ2, Rˆq

Geometry and Stability of Supervised Learning Problems

τ : X ˆ Y ˆ X1 ˆ Y 1 Ñ X ˆ X1 ˆ Y ˆ Y 1 swaps the second and third components,

ℓ2ppy1, y1 1q, py2, y1 2qq : ℓpy1, y2q

Rˆ : th ˆ h1|ph, h1q P Ru.

Then the projection maps f1 : X ˆ X1 Ñ X and f2 : Y ˆ Y 1 Ñ Y show that P 2 is a simulation of P. Indeed, since γ is a coupling of η and η1, we have

pf1 ˆ f2q7τ7γ ppf1 ˆ f2q τq7γ pπ12q7γ η.

Furthermore, for any h P H, we can consider some function ph ˆ h1q P Rˆ and write

ℓ2 phˆh1qppx, x1q, py, y1qq ℓ2pphpxq, h1px1qq, py, y1qq ℓphpxq, yq ℓhpf1px, x1q, f2py, y1qq.

Similarly, P 2 is a simulation of P 1 via the projection maps g1 : X ˆ X1 Ñ X1 and g2 : Y ˆ Y 1 Ñ Y 1. Indeed, an almost identical calculation to the one above shows that pg1 ˆ g2q7τ7γ η1. Furthermore, if h1 P H1 and ph ˆ h1q P Rˆ, we have already calculated that ℓ2 phˆh1qppx, x1q, py, y1qq ℓphpxq, yq. The fact that R and γ have Risk distortion zero impliex that the latter is equal τ7γ-almost everywhere to ℓ1ph1px1q, y1q ℓ1 h1pg1px, x1q, g2py, y1qq, completing the proof.

A.1.1 Connecting the Risk Distance with the Gromov-Wasserstein Distance

Here we discuss a connection with a variant of the Gromov-Wasserstein distance considered by Hang et al. (2019). Given two metric measure spaces X, X1, a correspondence R P Cp X, X1q and a coupling µ P ΠpµX, µX1q, one considers the following notion of distortion:

disp R, µq : sup px2,x1 2q PR

ˇˇd Xpx1, x2q d X1px1 1, x1 2q ˇˇµpdx1 ˆ dx1 1q,

which leads to the deﬁnition

d GWp X, X1q : inf R,µ disp X, X1q.

This deﬁnition, which blends an L8 matching (expressed through the correspondence R) with an L1 one (expressed via the coupling µ) was considered by Hang et al. (2019) due its beneﬁcial interaction with stability questions surrounding Fr echet functions in the persistent homology setting.9 With the notation and deﬁnitions from Section 3.5, we have the following statement.

Proposition 102 For all compact metric measure spaces X and X1: d Rpip Xq, ip X1qq d GWp X, X1q.

9. The precise variant considered in (Hang et al., 2019) uses continuous correspondences. The version we consider here is structurally identical but slightly softer due to our use of arbitrary correspondences.

M emoli, Vose, Williamson

Proof First note that any correspondence R between HX and HX1 can be regarded as a correspondence between X and X1. This is so because all functions in HX (resp. in HX1) are constant. We then have the following calculation:

disip Xq,ip X1qp R, γq sup ph,h1q PR

ż ˇˇd Xphpx0q, x1q d X1ph1px1 0q, x1 1q ˇˇ γpdx0ˆdx1ˆdx1 0ˆdx1 1q

sup px2,x1 2q PR

ż ˇˇd Xpx2, x1q d X1px1 2, x1 1q ˇˇ γpdx0ˆdx1ˆdx1 0ˆdx1 1q

sup px2,x1 2q PR

ż ˇˇd Xpx2, x1q d X1px1 2, x1 1q ˇˇ pπ2,4q7γpdx0ˆdx1ˆdx1 0ˆdx1 1q

disp R, pπ2,4q7γq.

The above equality immediately implies that d Rpip Xq, ip X1qq ě d GWp X, X1q. The reverse inequality can be obtained from the following fact. Let τ : X ˆ X1 Ñ X ˆ X ˆ X1 ˆ X1

be the map px, x1q ÞÑ px, x, x1, x1q. Then, if µ P ΠpµX, µX1q, γµ : τ7µ is an element of Πpp Xq7µX, p X1q7µX1q. Since π2,4 τ id : X ˆ X1 Ñ X ˆ X1, the identity map, pπ2,4q7γµ µ and the calculation above gives disip Xq,ip X1qp R, γµq dispµq. From this we obtain that d Rpip Xq, ip X1qq ď d GWp X, X1q.

Proof [Proposition 33] If d GW,pp X, X1q 0, then X and X1 are isomorphic as metric measure spaces. Using the isomorphism X Ñ X1, it is not hard to construct a coupling and correspondence between ip Xq and ip X1q of risk distortion zero. Conversely, let d Rpip Xq, ip X1qq 0. It is not a priori obvious that there exists a correspondence R P Cp HX, HX1q and a coupling γ P Πpp Xq7µX, p X1q7µX1q that together achieve the inﬁmal risk distortion of zero. In Section 5.2, we will explore the existence of such couplings and correspondences. In this case, since d X and d X1 are continuous functions and X and X1 are assumed to be compact, Theorem 60 does guarantee that there exists an R and a γ with disip Xq,ip X1qp R, γq 0. Furthermore, recall that a correspondence R between HX and HX1 is equivalent to a correspondence between X and X1. Hence, as in the proof of Proposition 102, we have 0 disip Xq,ip X1qp R, γq disp R, pπ2,4q7γq. Set R1 : suppppπ2,4q7γq. Since the integrand deﬁning disp R, pπ2,4q7γq is continuous and non-negative, we can conclude that

d Xpx2, x1q d X1px1 2, x1 1q (3)

for all px2, x1 2q P R and px1, x1 1q P R1. Furthermore, since pπ2,4q7γ is a coupling between the fully supported measures µX and µX1, its support R1 is a correspondence between X and X1

(see M emoli, 2011, Lemma 2.2). We claim that R R1. Indeed, if px2, x1 2q P R, selecting any pair px2, x1 1q P R1 yields d Xpx2, x2q d X1px1 2, x1 1q and hence x1 2 x1 1. This shows R Ď R1, and a similar argument shows the opposite inclusion. We furthermore claim that R is the graph of a bijection. Let px, x1q, px, x2q P R. Then d Xpx, xq d X1px1, x2q, showing x1 x2. Again, a similar argument holds when reversing the roles of X and X1. Let f : X Ñ X1 be the bijection whose graph is R. Then equation (3) becomes d Xpx2, x1q d X1pfpx2q, fpx1qq, showing that f is an isometry. Additionally, since pπ2,4q7γ is a coupling of µX and µX1 which

Geometry and Stability of Supervised Learning Problems

is supported on R, the projections R Ñ X and R Ñ X1 are measure-preserving bijections. It follows that the composition of the natural maps X Ñ R Ñ X1, which agrees with f, is a measure-preserving map. Hence f is an isomorphism of metric measure spaces, and d GW,1p X, X1q 0.

A.2 Proofs from Section 4

Proof [Theorem 37] Let Prob1p Rě0q Ď Probp Rě0q be the subset of probability measures with ﬁnite mean. The mean function

mean : Prob1p Rě0q Ñ Rě0

is 1-Lipschitz when the domain is endowed with the 1-Wasserstein distance d W,1 and the codomain is given the Euclidean distance. We will now lift the mean function to a function on power sets by applying it elementwise. That is, we deﬁne

Powpmeanq : Powp Prob1p Rě0qq Ñ Powp Rě0q

tαi|i P Iu ÞÑ tmeanpαiq|i P Iu .

We also endow the domain and codomain with the Hausdorﬀdistance, where the underlying metrics are d W,1 and the Euclidean distance respectively. Since the mean function is 1Lipschitz, so is Powpmeanq. The inﬁmum function inf : Powp Rě0q Ñ Rě0 1-Lipschitz as well, so the composition inf Powpmeanq is 1-Lipschitz. In other words, for any M, N Ď Prob1p Rě0q, we have ˇˇ inf tmeanpµq|µ P Mu inf tmeanpνq|ν P Nu ˇˇ ď dd W,1 H p M, Nq.

By Proposition 36, applying this inequality when M and N are loss proﬁle sets gives the ﬁrst desired inequality. To prove the second inequality, observe that dp P, P 1q : dd W,1 H p Lp Pq, Lp P 1qq is a pseudometric on the collection of all problems. By Theorem 28, we need only show that d satisﬁes Conditions 1 and 2. Beginning with Condition 1, let P 1 be a simulation of P via the maps f1 : X1 Ñ X and f2 : Y 1 Ñ Y . Let f : f1 ˆ f2. If h P H and h1 P H1 are such that f2 h h1 f2, then

ℓ1 h1 fpx, yq ℓ1ph1 f1pxq, f2pyqq ℓ1pf2 hpxq, f2pyqq ℓphpxq, yq ℓhpx, yq.

Hence the loss proﬁles of h and h1 are equal:

pℓ1 h1q7η1 pℓ1 h1 fq7η pℓhq7η.

Such an h1 P H1 exists for every h P H and vice versa by deﬁnition of a simulation, so we conclude that Lp Pq Lp P 1q and dp P, P 1q 0. For Condition 2, let h P H. We can bound the Wasserstein distance between the loss proﬁles pℓhq7η and pℓ1 hq7η by selecting the coupling pℓh, ℓ1 hq7η, which yields

d W,1ppℓhq7η, pℓ1 hq7ηq ď ż ˇˇr s ˇˇpℓh, ℓ1 hq7ηpdrˆdsq

ż ˇˇℓhpx, yq ℓ1 hpx, yq ˇˇ ηpdx ˆ dyq.

M emoli, Vose, Williamson

Since this holds for all h, Condition 2 is satisﬁed as well.

Proof [Theorem 41] Let R P Cp H, H1q and γ P Πpη, η1q be arbitrary. Then ˇˇRmp Pq Rmp P 1q ˇˇ

ż ż sup h PH

i 1 σiℓhpxi, yiq Radbmpdσq ηbmpdx ˆ dyq

ż ż sup h1PH1 1 m

i 1 σiℓ1 h1px1 i, y1 iq Radbmpdσq ηbmpdx1 ˆ dy1q ˇˇˇˇ

ď ż ż sup ph,h1q PR

i 1 σiℓhpxi, yiq 1

i 1 σiℓ1 h1px1 i, y1 iq

ˇˇˇˇˇ Radbmpdσq γbmpdxˆdyˆdx1ˆdy1q

ď ż sup ph,h1q PR

ˇˇℓhpxi, yiq ℓ1 h1px1 i, y1 iq ˇˇ γbmpdxˆdyˆdx1ˆdy1q.

Let pγ be the unique probability measure on the countable product p X ˆ Y q N whose marginal on the ﬁrst m components is γbm. Then the above integral can be written

ż sup ph,h1q PR

ˇˇℓhpxi, yiq ℓ1 h1px1 i, y1 iq ˇˇ pγpdxˆdyˆdx1ˆdy1q. (4)

We now aim to establish the bound ż sup ph,h1q PR

ˇˇℓhpxi, yiq ℓ1 h1px1 i, y1 iq ˇˇ pγpdxˆdyˆdx1ˆdy1q

sup ph,h1q PR

ż ˇˇℓhpxi, yiq ℓ1 h1px1 i, y1 iq ˇˇγpdxˆdyˆdx1ˆdy1q

since inﬁmizing this inequality over γ and R will give us the result. It will suﬃce to establish

ż sup ph,h1q PR

ˇˇℓhpxi, yiq ℓ1 h1px1 i, y1 iq ˇˇ ż ˇˇℓhpxi, yiq ℓ1 h1px1 i, y1 iq ˇˇ ﬀ

pγpdxˆdyˆdx1ˆdy1q

ď 2Radmp Fq,

which, in probabilistic notation, is written

E sup ph,h1q PR

ˇˇℓhp Xi, Yiq ℓ1 h1p X1 i, Y 1 i q ˇˇ E ˇˇℓhp X, Y q ℓ1 h1p X1, Y 1q ˇˇ ﬀ

ď 2Radmp Fq (5)

where the ﬁrst expectation is taken over the tuples p Xi, Yi, X1 i, Y 1 i q each sampled i.i.d. from γ, and the second expectation is over p X, Y, X1, Y 1q also sampled from γ independently of the other tuples. The expression on the left hand side of Equation 5 is well-studied in the

Geometry and Stability of Supervised Learning Problems

statistical learning theory literature, and the bound in equation 5 is easily obtained by a symmetrization argument like the one laid out in (Liao, 2020, Theorem 4.1). Now assume F is a universal Glivenko-Cantelli class. Return to the expression (4) and let m Ñ 8. By the Glivenko-Cantelli assumption, the integrand of (4) converges almost everywhere to

sup ph,h1q PR

ż ˇˇℓhpx, yq ℓ1 h1px1, y1q ˇˇ γpdxˆdyˆdx1ˆdy1q dis P,P 1p R, γq.

Furthermore, since ℓand ℓ1 are bounded, by the dominated convergence theorem, the limit of (4) itself is dis P,P 1p R, γq as well. Inﬁmizing over the choices of R and γ completes the proof.

Proof [Lemma 48] We aim to bound d Rp P, P 1q. To this end, deﬁne

: X ˆ Y ˆ Y Ñ X ˆ Y ˆ X ˆ Y

px, y, y1q ÞÑ px, y, x, y1q.

Let Πg be the collection of all gluings of η and η1 along X. That is,

Πg : λ P Probp X ˆ Y ˆ Y q ˇˇpπ1,2q7λ η, pπ1,3q7λ η1( .

By the gluing lemma (Lemma 1), Πg is nonempty. Note that if λ P Πg, then 7λ P Πpη, η1q. We can then write

d Rp P, P 1q ď inf γPΠpη,η1q

ż ˇˇℓphpxq, yq ℓphpx1q, y1q ˇˇ γpdx ˆ dy ˆ dx1 ˆ dy1q

ż ˇˇℓphpxq, yq ℓphpx1q, y1q ˇˇ 7λpdx ˆ dy ˆ dx1 ˆ dy1q

ż ˇˇℓphpxq, yq ℓphpxq, y1q ˇˇ λpdx ˆ dy ˆ dy1q

ď C inf λPΠg

ż d Y py, y1q λpdx ˆ dy ˆ dy1q.

Now suppose τ P Πpβ, β1q. Then one can show that the induced coupling λ P Probp X ˆ Y ˆ Y q, which has X-marginal α and X-disintegration τ : X Ñ Probp Y ˆ Y 1q, lies in Πg. Hence we can continue our string of inequalities with

ď C inf τPΠpβ,β1q

ż ż d Y py, y1q τpxqpdy ˆ dy1qαpdxq Cdd Y W,1pβ, β1q,

the desired bound.

Proof [Theorem 49] By Lemma 48, we obtain

d Rp P, P 1q ď Cdd Y W,1pβ, β1q.

M emoli, Vose, Williamson

We can write β and β1 as

β X pδId X b βq δπ2 β1 X pδId X b βq N.

Here X is the diagonal kernel X : X Ñ X ˆX given by Xpxq δpx,xq, and b represents the product of Markov kernels. Note that

α X pδId X b βq η.

Hence X pδId X b βq is an example of a measure non-increasing Markov kernel (Patterson, 2020, Proposition 6.1), so we can write

dd Y W,1p X pδId X b βq δπ2, X pδId X b βq Nq ď dd Y W,1pδπ2, Nq,

completing the proof.

A.3 Proofs from Section 5

Proof [Lemma 58] The following proof uses ideas similar to those in the proof of Theorem 4.1 of Villani (2008), with the additional wrinkle of an integrand that depends on n and an additional trick of hot-swapping an integrand with a lower semi-continuous version. Since the topology on the domain is induced by a pseudometric, it is enough to consider an arbitrary convergent sequence phn, h1 n, γnq Ñ ph, h1, γq in H ˆ H1 ˆ Πpη, η1q and check that, Ă dis P,P 1ph, h1, γq ď lim inf n Ă dis P,P 1phn, h1 n, γnq.

To that end, let ε ą 0 be arbitrary. We deﬁne a function

f : X ˆ Y ˆ X1 ˆ Y 1 Ñ Rě0 px0, y0, x1 0, y1 0q ÞÑ lim inf px,y,x1,y1qÑpx0,y0,x1 0,y1 0q |ℓhpx, yq ℓ1 h1px1, y1q|.

That is, f is a lower semi-continuous version of the function |ℓh ℓ1 h1|, constructed by modifying the latter on its points of discontinuity. Indeed, f and |ℓh ℓ1 h1| agree wherever the latter is continuous, which is γ-almost everywhere on X ˆY ˆX1 ˆY 1. To wit, if A Ď X ˆY and A1 Ď X1 ˆ Y 1 are the points at which ℓh and ℓ1 h1 are discontinuous respectively, then the points of discontinuity for |ℓh ℓ1 h1| lie within A ˆ X1 ˆ Y 1 Y X ˆ Y ˆ A1, and one can check

γp A ˆ X1 ˆ Y 1 Y X ˆ Y ˆ A1q ď γp A ˆ X1 ˆ Y 1q γp X ˆ Y ˆ A1q ηp Aq η1p A1q 0.

Hence f is a lower semi-continuous function which agrees γ-almost everywhere with |ℓh ℓ1 h1|. We can now proceed with our convergence argument; let ε ą 0 and pick N large enough that, for all n ě N,

dℓ,ηphn, hq ă ε{3,

dℓ1,η1ph1 n, h1q ă ε{3,

Geometry and Stability of Supervised Learning Problems

and10 ż fpx, y, x1, y1q dγ ă ż fpx, y, x1, y1q dγn ε{3.

The latter is possible since f is a lower semi-continuous, non-negative function (Villani, 2008, Lemma 4.3). Then for large n we have

Ă dis P,P 1ph, h1, γq ż ˇˇℓhpx, yq ℓ1 h1px1, y1q ˇˇ dγ

ż fpx, y, x1, y1q dγ

ď ż fpx, y, x1, y1q dγn ε{3

ż ˇˇℓhpx, yq ℓ1 h1px1, y1q ˇˇ dγn ε{3

ď ż ˇˇℓhpx, yq ℓhnpx, yq ˇˇ dγn

ż ˇˇℓhnpx, yq ℓ1 h1npx1, y1q ˇˇ dγn

ż ˇˇℓh1npx1, y1q ℓ1 h1px1, y1q ˇˇ dγn ε{3

dℓ,ηph, hnq Ă dis P,P 1phn, h1 n, γnq dℓ1,η1ph1, h1 nq ε{3 ă Ă dis P,P 1phn, h1 n, γnq ε

as desired. In the last equality, we replace γn with η and η1 in the ﬁrst and last integrals respectively since γn is a coupling between η and η1.

In order to prove Lemma 59 we will require the following technical lemma.

Lemma 103 Let X be a pseudometric space and f : X Ñ R a lower semi-continuous function. Then the map

F : Powp Xq Ñ R

A ÞÑ sup a PA fpaq

is lower semi-continuous as well when the power set Powp Xq is endowed with the Hausdorﬀ metric.

Proof Let the sequence of subsets An Ď X converge to A Ď X in the Hausdorﬀdistance. Select a1 P A such that sup a PA fpaq ă fpa1q ε{2.

Select also a sequence a1 n P X such that a1 n P An for all n, and a1 n converges to a1. Then for suﬃciently large n, fpa1q ă fpa1 nq ε{2

10. In this proof, for conciseness, we will write ş fpx, y, x1, y1qdγ instead of ş fpx, y, x1, y1q γpdxˆdyˆdx1ˆdy1q, etc.

M emoli, Vose, Williamson

by lower semi-continuity of f. Hence for large n,

Fp Aq sup a PA fpaq ă fpa1q ε{2 ă fpa1 nq ε ď sup a PAn fpaq ε Fp Anq ε.

Hence F is lower semi-continuous.

Proof [Lemma 59] Write

dis P,P 1p R, γq sup pph,νq,ph1,ν1qq PRˆtγu Ă dis P,P 1ph, h1, νq.

Using the above equation, we can think of dis P,P 1 as a composition of two maps:

Cp H, H1q ˆ Πpη, η1q Ñ Powp H ˆ H1 ˆ Πpη, η1qq

p R, γq ÞÑ R ˆ tγu,

Powp H ˆ H1 ˆ Πpη, η1qq Ñ R

A ÞÑ sup ph,h1,νq PA Ă dis P,P 1ph, h1, νq.

The second map is lower semi-continuous by Lemma 103. The ﬁrst map is an isometric embedding. Indeed, let dw be a metric on Πpη, η1q that induces the weak topology. The topology on the domain Cp H, H1q ˆ Πpη, η1q is induced by the metric

d1pp R1, γ1q, p R2, γ2qq : maxtd Hp R1, R2q, dwpγ1, γ2qu.

Meanwhile, unwrapping the metric on the codomain Powp H ˆ H1 ˆ Πpη, η1qq gives us

d2p R1 ˆ tγ1u, R2 ˆ tγ2uq

sup ph1,h1 1,ν1q PR1ˆtγ1u inf ph2,h1 2,ν2q PR2ˆtγ2u maxtd HˆH1pph1, h1 1qph2, h1 2qq, dwpν1, ν2qu,

sup ph2,h1 2,ν2q PR2ˆtγ2u inf ph1,h1 1,ν1q PR1ˆtγ1u maxtd HˆH1pph1, h1 1qph2, h1 2qq, dwpν1, ν2qu

sup ph1,h1 1q PR1 inf ph2,h1 2q PR2 d HˆH1pph1, h1 1qph2, h1 2qq,

sup ph2,h1 2q PR2 inf ph1,h1 1q PR1 d HˆH1pph1, h1 1qph2, h1 2qq, dwpγ1, γ2q

maxtd Hp R1, R2q, dwpγ1, γ2qu,

which is the deﬁnition of d1pp R1, γ1q, p R2, γ2qq. Hence the composition is lower semicontinuous.

Geometry and Stability of Supervised Learning Problems

Proof [Theorem 60] Throughout this proof, let Ccp H, H1q Ď Cp H, H1q be the collection of topologically closed correspondences between H and H1. Note that Πpη, η1q is compact under the weak topology, since η and η1 are probability measures on Polish spaces. Compactness of such coupling spaces is proved, for instance, by Sturm (2023, Lemma 1.2), Chowdhury and M emoli (2019, Lemma 2.2) and in a proof by Villani (2003, Prop 2.1). Similarly, Ccp H, H1q is compact under the Hausdorﬀdistance when H and H1 are compact. (Here we are equipping H and H1 with the pseudometrics dℓ,η and dℓ1,η1 respectively, introduced in Section 4.5.) As noted by Chowdhury and M emoli (2018, Section 2), compactness of such correspondence spaces follows from a theorem of Blaschke (Burago et al., 2001, Theorem 7.3.8). Indeed, Blaschke s theorem implies that the space of all closed subsets of H ˆ H1 is compact. Since Ccp H, H1q is closed in the set of all closed subsets of H ˆ H1, being the intersection of the preimages of H and H1 under the projection maps onto the ﬁrst and second coordinates respectively, we get that Ccp H, H1q is compact as well. The lower semi-continuity of dis P,P 1 (Lemma 59) in particular implies that for any correspondence R and coupling γ, disp R, γq dispclp Rq, γq, where cl represents topological closure. Indeed, since R and its closure are distance zero apart in the Hausdorﬀdistance, the constant sequence R converges to clp Rq and vice versa. Hence, when computing the Risk distance, we can restrict our attention to closed correspondences:

d Rp P, P 1q inf RPCp H,H1q γPΠpη,η1q

dis P,P 1p R, γq inf RPCcp H,H1q γPΠpη,η1q

dis P,P 1p R, γq.

Since dis P,P 1 is lower semi-continuous, and the latter inﬁmum is over a pair of compact sets, the inﬁmum must be achieved.

Proof [Theorem 65] First consider an arbitrary ﬁnite partition Q of Y . Let π : Y Ñ Q be the quotient map. Since p Id X ˆπq7η ηQ, the map Id X ˆπ induces a coupling γp Id X ˆπq between η and ηQ. We also deﬁne a correspondence Rπ between H and HQ by pairing each h P H with π h P HQ. Selecting this coupling and correspondence gives us a bound on the Risk distance:

d Rp P, PQq ď dis P,PQp Rπ, γId X ˆπq

ď sup ph,π hq PRπ

ż ˇˇℓphpxq, yq ℓ1pπ hpx1q, y1q ˇˇ γp Id X ˆπqpdx ˆ dy ˆ dx1 ˆ dy1q

ż ˇˇℓphpxq, yq ℓ1pπ hpxq, πpyqq ˇˇ ηpdx ˆ dyq

ż sup y1Pπ hpxq y2Pπpyq

ℓpy1, y2q ℓphpxq, yq ηpdx ˆ dyq

ż sup y1Pπ hpxq y2Pπpyq

ℓpy1, y2q inf y1Pπ hpxq y2Pπpyq

ℓpy1, y2q ηpdx ˆ dyq

M emoli, Vose, Williamson

ď sup h PH sup px,yq PXˆY

sup y1Pπ hpxq y2Pπpyq

ℓpy1, y2q inf y1Pπ hpxq y2Pπpyq

ď max q,q1PQ

sup y1Pq1 y2Pq

ℓpy1, y2q inf y1Pq1 y2Pq

The bound (6) holds for an arbitrary ﬁnite partition Q of Y . Since P is a compact problem, we can pick a metric d Y on Y with respect to which Y is compact. (Recall that we required our response spaces to be Polish, which makes Y metrizable.) In particular, Y ˆ Y is a compact metric space under the product metric, given by d Y ˆY ppy1, y1 1q, py2, y1 2qq : maxtd Y py1, y2q, d Y py1 1, y1 2qu.

By compactness, ℓis uniformly continuous with respect to d Y ˆY . Hence we can pick δ ą 0 such that, if y1, y1 1, y2, y1 2 P Y and

d Y ˆY ppy1, y1 1q, py2, y1 2qq ă δ,

then ˇˇℓpy1, y1 1q ℓpy2, y1 2q ˇˇ ă ε.

Again using compactness, we choose a partition Q of Y such that diamd Y pqq ă δ for all q P Q. Then for any q, q1 P Q, y1, y2 P q, y1 1, y1 2 P q1, we have

d Y ˆY ppy1, y1 1q, py2, y1 2qq maxtd Y py1, y2q, d Y py1 1, y1 2qu ă δ.

Hence ˇˇℓpy1, y1 1q ℓpy2, y1 2q ˇˇ ă ε.

Taking a supremum over all choices of q, q1, y1, y2, y1 1, y1 2, combined with inequality (6), gives us the result.

Proof [Theorem 69] We will proceed by approximating P and Pn by problems with ﬁnite input and output spaces. The process of approximation will proceed in steps depicted as follows:

P Restrict to ﬁnite H ÝÝÝÝÝÝÝÝÝÝÝÝÑ P 1 Replace Y with partition QY ÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÑ P 1 QY Replace X with partition QX ÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÝÑ P 1 QY ,QX

A similar sequence of approximations will be constructed for the empirical problem Pn to produce PQY ,QX,n. We then show that PQY ,QX,n converges a.s. to PQY ,QX. To wit, let ε ą 0 and construct new problems as follows. Since H is compact, we can ﬁnd a ﬁnite ε-net H1 Ď H. Deﬁne

P 1 : p X, Y, η, ℓ, H1q P 1 n : p X, Y, ηn, ℓ, H1q.

Geometry and Stability of Supervised Learning Problems

By Theorem 65, we can select a partition QY of Y such that the resulting coarsened problem P 1 QY is ε-close to P 1. Furthermore, by Remark 66, we can choose QY in a way that does not depend on the joint law η. We can coarsen P 1 n with respect to the same partition QY , giving us two new problems:

P 1 QY : p X, QY , p Id X ˆπQY q7η, ℓQY , H1 QY q P 1 QY ,n : p X, QY , p Id X ˆπQY q7ηn, ℓQY , H1 QY q.

Recall that πQY : Y Ñ QY is the canonical quotient map, ℓQY pq, q1q : supy Pq,z Pq1 ℓpq, q1q, and H1 QY : tπQY h|h P H1u. For the third and ﬁnal round of approximation, notice that A : th 1pqq|h P H1 QY , q P QY u is a ﬁnite collection of subsets of X. Let QX be the partition of X generated by A. That is, QX is the partition of X induced by the equivalence relation where x x1 means that hpxq hpx1q for all h P H1 QY . Then deﬁne

P 1 QY ,QX : p QX, QY , pπQX ˆ πQY q7η, ℓQY , Ć H1 QY q

P 1 QY ,QX,n : p QX, QY , pπQX ˆ πQY q7ηn, ℓQY , Ć H1 QY q,

where πQX : X Ñ QX is the canonical quotient map, and Ć H1 QY is the collection of functions sending πQXpxq ÞÑ hpxq as h ranges over H1 QY . We apply the triangle inequality to write

d Rp P, Pnq ďd Rp P, P 1q (1)

d Rp P 1, P 1 QY q (2)

d Rp P 1 QY , P 1 QY ,QXq (3)

d Rp P 1 QY ,QX, P 1 QY ,QX,nq (4)

d Rp P 1 QY ,QX,n, P 1 QY ,nq (5)

d Rp P 1 QY ,n, P 1 nq (6)

d Rp P 1 n, Pnq. (7)

We consider each of these terms individually.

(1) Since P and P 1 have the same joint law, we can bound d Rp P, P 1q above by selecting the diagonal coupling. Additionally, since H1 is an ε-net for H, we can select a correspondence R P Cp H, H1q such that dℓ,ηph, h1q ă ε for all ph, h1q P R. This produces the bound

d Rp P, P 1q ď sup ph,h1q PR

ż ˇˇℓphpxq, yq ℓph1pxq, yq ˇˇ ηpdx ˆ dyq sup ph,h1q PR dℓ,ηph, h1q ď ε.

(2) We chose QY so that P 1 and P 1 QY would be ε-close under d R, so this term is at most ε.

(3) We bound d Rp P 1 QY , P 1 QY ,QXq by selecting the coupling induced by πQXˆId QY : XˆY Ñ

QX ˆ Y . We also select the correspondence R between H1 QY and Ć H1 QY that pairs each

M emoli, Vose, Williamson

h P H1 QY with its counterpart h1 P Ć H1 QY which sends πQXpxq ÞÑ hpxq. Then we obtain the bound

d Rp P 1 QY , P 1 QY ,QXq ď sup ph,h1q PR

ż ˇˇℓphpxq, qq ℓph1pπQXpxqq, qq ˇˇ pπQX ˆ Id QY q7ηpdx ˆ dqq

sup ph,h1q PR

ż ˇˇℓphpxq, qq ℓphpxqq, qq ˇˇ pπQX ˆ Id QY q7ηpdx ˆ dqq 0.

Alternatively, one could note that P 1 QY is a simulation of P 1 QY ,QX via the maps πQX and Id QY and apply Theorem 28.

(4) For brevity, deﬁne

ν : pπQX ˆ πQY q7η

νn : pπQXˆπQY q7ηn

We apply the total variation bound from Proposition 44 to get

d Rp P 1 QY ,QX, P 1 QY ,QX,nq ď ℓmaxd TVpν, νnq.

Note that νn is the empirical measure for ν, which is a measure supported on a ﬁnite set. Hence by the law of large numbers, the total variation distance between ν and νn converges to 0 a.s. as n Ñ 8. Hence, with probability 1, for suﬃciently large n, we have d Rp P 1 QY ,QX, P 1 QY ,QX,nq ă ε.

(5) The same calculation we applied to term (3) shows that d Rp P 1 QY ,QX,n, P 1 QY ,nq 0.

(6) Since we chose QY in a way that did not depend on the joint law η, and since P 1 and P 1 n only diﬀer in their joint laws, we get d Rp P 1 n, P 1 QY ,nq ă ε.

(7) Consider the (random) function on H ˆ H given by

dℓ,ηnph, h1q : ż ˇˇℓhpx, yq ℓh1px, yq ˇˇ ηnpdxˆdyq.

Let R be the same as in our calculations for term (1). Then we can write

d Rp P 1 n, Pnq ď sup ph,h1q PR

ż ˇˇℓhpx, yq ℓh1px, yq ˇˇ ηnpdx ˆ dyq sup ph,h1q PR dℓ,ηnph, h1q.

Furthermore, since we assumed the integrands |ℓh ℓh1| are a Glivenko-Cantelli class with respect to η, we have that for suﬃciently large n,

sup ph,h1q PR dℓ,ηnph, h1q ε ď sup ph,h1q PR dℓ,ηnph, h1q dℓ,ηph, h1q

ď sup ph,h1q PR

ˇˇdℓ,ηnph, h1q dℓ,ηph, h1q ˇˇ ă ε.

Here the Glivenko-Cantelli assumption is used in the last inequality. Hence

d Rp P 1 n, Pnq ď sup ph,h1q PR dℓ,ηnph, h1q ă 2ε.

Geometry and Stability of Supervised Learning Problems

Combining our bounds for all seven terms gives us that, with probability 1, for suﬃciently large n we have

d Rp P, Pnq ă ε ε 0 ε 0 ε 2ε 6ε.

Since ε was arbitrary, we conclude that d Rp P, Pnq Ñ 0 with probability 1.

A.4 Proofs from Section 6

Proof [Proposition 79] Let p X, d X, µXq and p Y, d Y , µY q be two metric measure spaces. The map

Πpp Xq7µX, p Y q7µY q Ñ ΠpµX, µY q

γ ÞÑ pπ2,4q7γ

is a bijection. Here, π2,4 : X ˆ X ˆ Y ˆ Y Ñ X ˆ Y is the map px, x1, y, y1q ÞÑ px1, y1q. Using this fact, we can write

d R,1p PX, PY q

inf γPΠpp Xq7µX,p Y q7µY q ρPΠpµX,µY q

ż ż ˇˇd Xphpx0q, x1q d Y ph1py0q, y1q ˇˇ γpdx0ˆdx1ˆdy0ˆdy1qρpdhˆdh1q

inf γPΠpp Xq7µX,p Y q7µY q ρPΠpµX,µY q

ż ż ˇˇd Xpx2, x1q d Y py2, y1q ˇˇ γpdx0ˆdx1ˆdy0ˆdy1qρpdx2ˆdy2q

inf γPΠpp Xq7µX,p Y q7µY q ρPΠpµX,µY q

ż ż ˇˇd Xpx2, x1q d Y py2, y1q ˇˇ pπ2,4q7γpdx1ˆdy1qρpdx2ˆdy2q

inf γPΠpµX,µY q ρPΠpµX,µY q

ż ż ˇˇd Xpx2, x1q d Y py2, y1q ˇˇ γpdx1ˆdy1qρpdx2ˆdy2q.

as desired.

Proof [Theorem 81] First, for any h P H, h1 P H1, we can bound the Wasserstein distance between their loss proﬁles by

d W,1ppℓhq7η, pℓ1 h1q7η1q inf τPΠppℓhq7η,pℓ1 h1q7η1q

ż |a b| τpdaˆdbq

ď inf γPΠpη,η1q

ż |a b| pℓh ˆ ℓ1 h1q7pdaˆdbq

inf γPΠpη,η1q

ż |ℓhpx, yq ℓ1 h1px1, y1q| γpdxˆdyˆdx1ˆdy1q.

M emoli, Vose, Williamson

Similarly, we can bound

dd W,1 W,p p Lp Pq, Lp P 1qq inf τPΠpp FP q7λ,p FP q7λ1q

ˆż dp W,1pµ, νq τpdµˆdνq 1{p

ď inf ρPΠpλ,λ1q

ˆż dp W,1pµ, νq p FP ˆ FP 1q7ρpdµˆdνq 1{p

inf ρPΠpλ,λ1q

ˆż dp W,1ppℓhq7η, pℓ1 h1q7η1q ρpdhˆdh1q 1{p .

Combining these bounds gives

dd W,1 W,p p Lp Pq, Lp P 1qq

ď inf ρPΠpλ,λ1q

ˆż ˆ inf γPΠpη,η1q

ż |ℓhpx, yq ℓ1 h1px1, y1q| γpdxˆdyˆdx1ˆdy1q p ρpdhˆdh1q 1{p

ˆż ˆż |ℓhpx, yq ℓ1 h1px1, y1q| γpdxˆdyˆdx1ˆdy1q p ρpdhˆdh1q 1{p

d R,pp P, P 1q,

as desired.

Proof [Theorem 82] The support of a probability measure on a Polish space is σ-compact, so without loss of generality, we can assume that Y is σ-compact by replacing Y with the support of the Y -marginal of η. Write Y as an increasing sequence of compact subsets p Ynq8 n 1. For each n, deﬁne πn : Y Ñ Yn Y t u by

# y y P Yn else .

We then deﬁne a sequence of weighted problems p Pn, λnq. Deﬁne

Pn : p X, Yn Y t u, ηn, ℓn, Hnq

ηn : p Id X ˆπnq7η

ℓnpy, y1q : 1YnˆYnpy, y1qℓpy, y1q

Hn : tπn h|h P Hu.

Furthermore, if we let πn : H Ñ Hn be given by πn phq : πn h, we can deﬁne λn : pπn q7λ. We claim that d R,pp P, Pnq converges to 0. Since ηn p Id X ˆπnq7η, the function Id X ˆπn induces a coupling between η and ηn. Similarly, since λn pπn q7λ, the function πn induces

Geometry and Stability of Supervised Learning Problems

a coupling between λ and λn. By deﬁnition of d R,p, choosing these couplings produces a bound on d R,pp P, Pnq.

d R,pp P, Pnq ď ˆż ˆż ˇˇℓphpxq, yq ℓ1pπ hpxq, πpyqq ˇˇ ηpdxˆdyq p λpdhq 1{p

ˆż ˆż 1p YnˆYnqcphpxq, yq ℓphpxq, yq ηpdx ˆ dyq p λpdhq 1{p .

Here the superscript c represents set complement. Notice that the indicator 1p YnˆYnqcphpxq, yq converges pointwise to 0 as n Ñ 8. We apply the dominated convergence theorem twice to ﬁnish the proof. Indeed, by assumption,

diampp Pq ˆż ˆż ℓphpxq, yq ηpdx ˆ dyq p λpdhq 1{p ă 8. (7)

In particular, (7) tells us that ℓphpxq, yq is integrable as a function of px, yq for λ-a.e. h P H. Since ℓphpxq, yq dominates 1p YnˆYnqcphpxq, yqℓphpxq, yq, the dominated convergence theorem tells us that ż 1p YnˆYnqcphpxq, yqℓphpxq, yq ηpdx ˆ dyq nÑ8 ÝÝÝÑ 0

for λ-a.e. h P H. Additionally, (7) tells us that

ˆż ℓphpxq, yq ηpdx ˆ dyq p

is an integrable function of h. Using the dominated convergence theorem again,

ˆż ˆż 1p YnˆYnqcphpxq, yqℓphpxq, yq ηpdx ˆ dyq p λpdhq 1{p nÑ8 ÝÝÝÑ 0

as desired.

Proof [Proposition 83] Let ε ą 0. Let Q be the partition of Y provided by Theorem 65, and let

PQ p X, Q, ηQ, ℓQ, HQq

be the coarsened problem. Let Rπ P Cp H, HQq be the correspondence induced by the surjection π phq π h, and let γp Id X ˆπq P Πpη, ηQq be the coupling induced by the map Id X ˆπ. Recall the proof of Theorem 65, in which we demonstrated that d Rp P, PQq ă ε by showing

dis P,PQp Rπ, γp Id X ˆπqq ă ε.

M emoli, Vose, Williamson

Deﬁne a measure λQ pπ q7λ on HQ and let ρπ P Πpλ, λQq be the coupling induced by π . Then, since Rπ is the support of ρπ,

d R,ppp P, λq, p PQ, λQqq ď dis P,PQ,ppρπ, γId X ˆπq

ˆż ˆż ˇˇℓphpxq, yq ℓQph1px1q, y1q ˇˇ γId X ˆπpdxˆdyˆdx1ˆdy1q p ρπpdhˆdh1q 1{p

ď sup ph,h1q PRπ

ż ˇˇℓphpxq, yq ℓQph1px1q, y1q ˇˇ γId X ˆπpdxˆdyˆdx1ˆdy1q

dis P,PQp Rπ, γId X ˆπq ă ε,

as desired.

Proof [Theorem 91] Proposition 89 provides the bound d Rp P, PNq ď dsℓ,λ,p W,1 pη, η Nq. One can check that if τ is a coupling of the Markov kernels δId XˆY and N, then η τ P Πpη, η Nq. Therefore we can write

dsℓ,λ,p W,1 pη, η Nq

inf γPΠpη,η Nq

p XˆY q2 sℓ,λ,pppx, yq, px1, y1qq γpdxˆdyˆdx1ˆdy1q

ď inf τPΠpδId XˆY ,Nq

p XˆY q2 sℓ,λ,pppx, yq, px1, y1qq pη τqpdxˆdyˆdx1ˆdy1q

inf τPΠpδId XˆY ,Nq

p XˆY q2 sℓ,λ,pppx1, y1q, px2, y2qq τpx, yqpdx1ˆdy1ˆdx2ˆdy2q ηpdxˆdyq

dsℓ,λ,p W,1 pδId XˆY , Nq,

which is the desired bound.

A.5 Proofs from Section 7

Proof [Lemma 95] Let

R : tph1, h2, h3q P H1ˆH2ˆH3|ph1, h2q P R1,2, ph2, h3q P R2,3u .

We have the following diagram.

p1 p2 q1 q2

Every pictured map is a projection map. We need only check that they are inverse connected, meaning that the preimage of any connected set is connected. Let A Ď H1 be a connected

Geometry and Stability of Supervised Learning Problems

set. Then p 1 1 p Aq is connected, and so is B : p2pp 1 1 p Aqq. Similarly, C : q2pq 1 1 p Bqq is connected. Noting ﬁnally that the preimage of A in R is

r 1 1 pp 1 1 qp Aq A ˆ B ˆ C,

we see that it is a product of connected sets and hence connected. The proof for preimages of connected sets in H2 and H3 is similar.

Proof [Theorem 100] For the ﬁrst inequality, suppose pf, pg : Z Ñ R are continuous functions inducing the Reeb graphs Reebp Pq and Reebp P 1q respectively. Then

} pf pg}8 ě ˇˇˇˇinf z PZ pfpzq inf z PZ pgpzq ˇˇˇˇ ˇˇˇˇ inf h PH RP phq inf h1PH1 RP 1ph1q ˇˇˇˇ |Bp Pq Bp P 1q|

where the equality comes from the fact that the inﬁmum of a function can be determined from the Reeb graph that it induces. Now suppose d RCp P, P 1q ă a. Then we can pick a correspondence R P Cconp H, H1q that witnesses this fact, giving us the following diagram.

Recall RP is the risk function associated with problem P. We can then induce two Reeb graphs from functions on R: one induced by RP p1 and the other induced by RP 1 p2. The former Reeb graph is isomorphic to Reebp Pq since p1 is a continuous surjection with connected ﬁbers, and similarly the latter is isomorphic to Reebp P 1q. Hence

d Up Reebp Pq, Reebp P 1qq ď sup ph,h1q PR |RP p1ph, h1q RP p1ph, h1q|

sup ph,h1q PR |RP phq RP ph1q|

ď inf γPΠpη,η1q sup ph,h1q PR

ż ˇˇℓhpx, yq ℓh1px1, y1q ˇˇ γpdxˆdyˆdx1ˆdy1q ă a.

Hence d Up Reebp Pq, Reebp P 1qq ă a as well.

Appendix B. Weak Isomorphism with Classiﬁcation Problems

While the inﬁmum in the deﬁnition of the Risk distance makes it diﬃcult to characterize when two problems are distance zero apart in general, the closely related notion of weak isomorphism is simpler to deal with. For instance, we can characterize which problems are weakly isomorphic to classiﬁcation problems.

M emoli, Vose, Williamson

Deﬁnition 104 Call a problem P rectangular if there exists a ﬁnite partition Y1, . . . , Ym of the response space Y and, for any h P H, a ﬁnite partition X1, . . . , Xn of the input space X such that ℓhpx, yq ÿ

ij aij1Xh i pxq1Yjpyq η-a.e..

A problem P is rectangular, for instance, if the loss is a sum of indicators on rectangles. That is, if ℓpy1, y2q ÿ

ij aij1Yipy1q1Yjpy2q

for some partition Y1, . . . , Yn Ď Y , then P is rectangular.

Theorem 105 A problem is weakly isomorphic to a classiﬁcation problem if and only if it is rectangular.

The proof makes use of the notion of a generated partition. Let S be a set and let T1, . . . , Tn be a collection of subsets of S. Then the partition of S generated by T1, . . . , Tn is the partition of S induced by the equivalence relation where s s1 means that s P Ti ðñ s1 P Ti for all 1 ď i ď n. Note that since there are ﬁnitely many Ti, the generated partition has ﬁnitely many blocks as well. Proof Suppose P is rectangular and let Q : t Y1, . . . , Ymu, X1, . . . , Xn, and paijqij be as in the deﬁnition of a rectangular problem, and let πQ : Y Ñ Q be the quotient map. If m n, pad the partitions appropriately with copies of the empty set and pad the matrix paijqij with zeros to make m n. Deﬁne a classiﬁcation problem

P 1 : p X, Q, p Id X ˆπQq7η, ℓ1, H1q

ℓ1p Yi, Yjq : aij

H1 tφphq|h P Hu, where φphq : X Ñ Q sends x to Yi if x P Xh i .

We claim P 1 is a simulation of P via the maps Id X and πQ. Indeed, for any h P H, we can calculate that ℓ1 φphqpx, πQpyqq aij if x P Xh i and y P Yj. Hence ℓ1 φphqpx, πQpyqq ℓhpx, yq for any h P H, showing that P 1 is a simulation of P. In particular, the two are weakly isomorphic. For the right hand implication, we proceed in two parts. First suppose that P is a simulation of a classiﬁcation problem P 1, whose response space is Y 1 ty1 1, . . . , y1 mu. Let the simulation be via the maps f1 : X Ñ X1 and f2 : Y Ñ Y 1. We will show that P is rectangular. By the deﬁnition of simulation, for any h P H, we can select an h1 P H1 such that ℓhpx, yq ℓ1 h1pf1pxq, f2pyqq η-almost everywhere. Since the function ℓ1 is on a ﬁnite set Y 1 ˆ Y 1, we can write it with a ﬁnite sum of indicators to get that, η-almost everywhere,

ℓhpx, yq ℓ1ph1 f1pxq, f2pyqq ÿ

ij ℓ1py1 i, y1 jq1Yiph1 f1pxqq1Yjpf2pyqq

ij ℓ1py1 i, y1 jq1h1 f 1 1 p Yiqpxq1f 1 2 p Yjqpyq.

Geometry and Stability of Supervised Learning Problems

Hence P is rectangular. Next we prove that if P is rectangular and simulates P 1, then P 1 must be rectangular as well. Let the simulation again be via maps f1 : X Ñ X1 and f2 : Y Ñ Y 1, and let t Xium i 1, t Yium i 1 and paijqij be as in the deﬁnition of a rectangular problem. Consider the family

Y : tf2r Yis|1 ď i ď mu

of subsets of Y 1. The sets in Y do not necessarily form a partition of Y 1, so consider the partition QY 1 generated by Y. Now let h1 P H1 be arbitrary and, using the deﬁnition of a simulation, pick h P H such that the diagram

commutes. Consider the family

X h1 : ! f1r Xh i s ˇˇˇ1 ď i ď m ) .

Again, these sets do not necessarily partition X1, so let Qh1 X1 be the partition of X1 generated by X h1. We claim that, if A P Qh1 X1 and B P QY 1, then ℓ1 h1 is constant η1-almost everywhere on A ˆ B. Indeed, by construction of Qh1 X1 and QY 1, we have that f1r Xh i s Ě A and f2r Yjs Ě B for some i and j. Since ℓh is constant on Xh i ˆ Yj η-almost everywhere, by commutativity of the above diagram we also have that ℓ1 h1 is constant η1-almost everywhere on pf1 ˆ f2qr Xh i ˆ Yjs Ě A ˆ B. Therefore, ℓ1 h1 can be written as a linear combination of indicators on such rectangles:

ℓ1 h1px1, y1q ÿ

ij aij1Aipxq1Bjpyq

for some paijqij. Here we have numbered the blocks in the partitions Qh1 X1 t A1, . . . , Amu and QY 1 t B1, . . . , Bnu. Hence P 1 is rectangular. Now, if P is weakly isomorphic to a classiﬁcation problem P 1 via a common simulation P 2, we have shown that P 2, as a simulation of a classiﬁcation problem P 1, must be rectangular. Furthermore, since P 2 is rectangular and simulates P, we have shown that P must be rectangular as well.

Margareta Ackerman. Towards Theoretical Foundations of Clustering. Ph D thesis, University of Waterloo, 2012.

M emoli, Vose, Williamson

Pankaj K. Agarwal, Kyle Fox, Abhinandan Nath, Anastasios Sidiropoulos, and Yusu Wang. Computing the Gromov-Hausdorﬀdistance for metric trees. In Khaled Elbassioni and Kazuhisa Makino, editors, Algorithms and Computation, Lecture Notes in Computer Science, pages 529 540, Berlin, Heidelberg, 2015. Springer. ISBN 978-3-662-48971-0. doi: 10.1007/978-3-662-48971-0 45.

David Alvarez-Melis and Nicolo Fusi. Geometric dataset distances via optimal transport. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 21428 21439, 2020.

David Alvarez-Melis and Tommi Jaakkola. Gromov-Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1881 1890, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1214.

Shun-ichi Amari. Information Geometry and Its Applications, volume 194 of Applied Mathematical Sciences. Springer Japan, Tokyo, 2016. ISBN 978-4-431-55977-1 978-4-43155978-8. doi: 10.1007/978-4-431-55978-8.

Luigi Ambrosio, Elia Bru e, and Daniele Semola. Lectures on Optimal Transport, volume 130 of UNITEXT. Springer International Publishing, Cham, 2021. ISBN 978-3-030-72161-9 978-3-030-72162-6. doi: 10.1007/978-3-030-72162-6.

Shreya Arya, Arnab Auddy, Ranthony A Clark, Sunhyuk Lim, Facundo M emoli, and Daniel Packer. The Gromov Wasserstein distance between spheres. Foundations of Computational Mathematics, pages 1 56, 2024.

Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. H-consistency bounds for surrogate loss minimizers. In Proceedings of the 39th International Conference on Machine Learning, pages 1117 1174. PMLR, June 2022. ISSN: 2640-3498.

Nihat Ay, J urgen Jost, Hˆong Vˆan Lˆe, and Lorenz Schwachh ofer. Information Geometry, volume 64 of Ergebnisse der Mathematik und ihrer Grenzgebiete 34. Springer International Publishing, Cham, 2017. ISBN 978-3-319-56477-7 978-3-319-56478-4. doi: 10.1007/ 978-3-319-56478-4.

Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. ISSN 1533-7928.

Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliﬀe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, March 2006. ISSN 0162-1459. doi: 10.1198/016214505000000907.

Ulrich Bauer, Claudia Landi, and Facundo M emoli. The Reeb graph edit distance is universal. Foundations of Computational Mathematics, 21(5):1441 1464, October 2021. ISSN 1615-3383. doi: 10.1007/s10208-020-09488-3.

Florian Beier, Robert Beinert, and Gabriele Steidl. On a linear Gromov Wasserstein distance. IEEE Transactions on Image Processing, 31:7292 7305, 2022.

Geometry and Stability of Supervised Learning Problems

Florian Beier, Robert Beinert, and Gabriele Steidl. Multi-marginal Gromov Wasserstein transport and barycentres. Information and Inference: A Journal of the IMA, 12(4): 2753 2781, 2023.

James O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, 2nd edition, 1985.

Silvia Biasotti, Daniela Giorgi, Michela Spagnuolo, and Bianca Falcidieno. Reeb graphs for shape analysis and applications. Theoretical Computer Science, 392(1):5 22, February 2008. ISSN 0304-3975. doi: 10.1016/j.tcs.2007.10.018.

Garrett Birkhoﬀand Erwin Kreyszig. The establishment of functional analysis. Historia Mathematica, 11(3):258 321, 1984.

Chris M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108 116, 1995.

David Blackwell. Comparison of experiments. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 1950, pages 93 102. University of California Press, Berkeley-Los Angeles, Calif., 1951.

Vladimir I. Bogachev. Measure Theory. Springer, 2007. ISBN 978-3-540-34513-8 978-3-54034514-5. doi: 10.1007/978-3-540-34514-5.

Charlotte Bunne, David Alvarez-Melis, Andreas Krause, and Stefanie Jegelka. Learning generative models across incomparable spaces. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 851 861. PMLR, 2019.

Dmitri Burago, Yuri Burago, and Sergei Ivanov. A Course in Metric Geometry. American Mathematical Society, 2001.

Gunnar Carlsson and Facundo M emoli. Classifying clustering schemes. Foundations of Computational Mathematics, 2013. doi: 10.1007/s10208-012-9141-9.

Gunnar E Carlsson and Facundo M emoli. Characterization, stability and convergence of hierarchical clustering methods. Journal of Machine Learning Research, 11(Apr): 1425 1470, 2010.

Fr ed eric Chazal and Jian Sun. Gromov-Hausdorﬀapproximation of metric spaces with linear structure, May 2013. ar Xiv:1305.1172.

Samantha Chen, Sunhyuk Lim, Facundo M emoli, Zhengchao Wan, and Yusu Wang. Weisfeiler Lehman meets Gromov-Wasserstein. In International Conference on Machine Learning, pages 3371 3416. PMLR, 2022.

Samir Chowdhury and Facundo M emoli. Distances and isomorphism between networks: stability and convergence of network invariants. Journal of Applied and Computational Topology, 7(2):243 361, Jun 2023. ISSN 2367-1734. doi: 10.1007/s41468-022-00105-6.

M emoli, Vose, Williamson

Samir Chowdhury and Facundo M emoli. Explicit geodesics in Gromov-Hausdorﬀspace. Electronic Research Announcements, 25(0):48 59, June 2018. ISSN 1935-9179. doi: 10.3934/era.2018.25.006. Publisher: Electronic Research Announcements.

Samir Chowdhury and Facundo M emoli. The Gromov-Wasserstein distance between networks and stable network invariants. Information and Inference: A Journal of the IMA, 8(4): 757 787, December 2019. ISSN 2049-8772. doi: 10.1093/imaiai/iaz026.

Samir Chowdhury and Tom Needham. Gromov-Wasserstein averaging in a Riemannian framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 842 843, 2020.

Samir Chowdhury, David Miller, and Tom Needham. Quantized Gromov-Wasserstein. In Nuria Oliver, Fernando P erez-Cruz, Stefan Kramer, Jesse Read, and Jose A. Lozano, editors, Machine Learning and Knowledge Discovery in Databases. Research Track, Lecture Notes in Computer Science, pages 811 827, Cham, 2021. Springer International Publishing. ISBN 978-3-030-86523-8. doi: 10.1007/978-3-030-86523-8 49.

Vincent Cohen-Addad, Varun Kanade, and Frederik Mallmann-Trenn. Clustering redemption beyond the impossibility of Kleinberg s axioms. Advances in Neural Information Processing Systems, 31, 2018.

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, volume 26, 2013.

Vin de Silva, Elizabeth Munch, and Amit Patel. Categoriﬁed Reeb graphs. Discrete & Computational Geometry, 55(4):854 906, Jun 2016. ISSN 1432-0444. doi: 10.1007/ s00454-016-9763-9.

Julie Delon, Agnes Desolneux, and Antoine Salmona. Gromov Wasserstein distances between gaussian distributions. Journal of Applied Probability, 59(4):1178 1198, 2022.

Luc Devroye, L aszl o Gy orﬁ, and G abor Lugosi. A Probabilistic Theory of Pattern Recognition, volume 31 of Stochastic Modelling and Applied Probability. Springer, New York, NY, 1996. ISBN 978-1-4612-6877-2 978-1-4612-0711-5. doi: 10.1007/978-1-4612-0711-5.

Oussama Dhifallah and Yue Lu. On the inherent regularization eﬀects of noise injection during training. In International Conference on Machine Learning, pages 2665 2675. PMLR, 2021.

Jean Dieudonn e. History of Functional Analysis. North-Holland, 1981.

Th eo Dumont, Th eo Lacombe, and Fran cois-Xavier Vialard. On the existence of Monge maps for the Gromov Wasserstein problem. Foundations of Computational Mathematics, pages 1 48, 2024.

Herbert Edelsbrunner, John Harer, Ajith Mascarenhas, and Valerio Pascucci. Time-varying reeb graphs for continuous space-time data. In Proceedings of the twentieth annual symposium on Computational geometry, SCG 04, pages 366 372, New York, NY, USA,

Geometry and Stability of Supervised Learning Problems

June 2004. Association for Computing Machinery. ISBN 978-1-58113-885-6. doi: 10.1145/ 997817.997872.

David A. Edwards. The structure of superspace. In Nick M. Stavrakas and Keith R. Allen, editors, Studies in Topology, pages 121 133. Academic Press, January 1975. ISBN 978-0-12-663450-1. doi: 10.1016/B978-0-12-663450-1.50017-7.

Francisco Escolano, Edwin R. Hancock, and Silvia Biasotti. Complexity fusion for indexing Reeb digraphs. In Richard Wilson, Edwin Hancock, Adrian Bors, and William Smith, editors, Computer Analysis of Images and Patterns, Lecture Notes in Computer Science, pages 120 127, Berlin, Heidelberg, 2013. Springer. ISBN 978-3-642-40261-6. doi: 10.1007/ 978-3-642-40261-6 14.

Arnaud Fickinger, Samuel Cohen, Stuart Russell, and Brandon Amos. Cross-domain imitation learning via optimal transport, April 2022. ar Xiv:2110.03684.

Ronald A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179 188, 1936. ISSN 2050-1439. doi: 10.1111/j.1469-1809.1936.tb02137.x.

Peter Gaenssler. Around the Glivenko Cantelli theorem. In R ev esz P al, editor, Limit theorems of probability theory, pages 93 103. North-Holland, 1975. ISBN 0720428343.

Xiaoyin Ge, Issam Safa, Mikhail Belkin, and Yusu Wang. Data skeletonization via Reeb graphs. In Advances in Neural Information Processing Systems, volume 24, 2011.

Mikhael Gromov. Structures M etriques pour les Vari et es Riemanniennes. Birkh auser, 1981.

Mikhael Gromov. Convergence and concentration of metrics and measures. In Metric Structures for Riemannian and Non-Riemannian Spaces, Modern Birkh auser Classics, pages 113 237. Birkh auser, Boston, MA, 2007. ISBN 978-0-8176-4583-0. doi: 10.1007/ 978-0-8176-4583-0 4.

Peter D. Gr unwald and Nishant A. Mehta. Fast rates for general unbounded loss functions: From erm to generalized Bayes. Journal of Machine Learning Research, 21(56):1 80, 2020. ISSN 1533-7928.

Haibin Hang, Facundo M emoli, and Washington Mio. A topological study of functional data and Fr echet functions of metric measure spaces. Journal of Applied and Computational Topology, 3(4):359 380, 2019.

Allen Hatcher. Algebraic topology. Cambridge University Press, 2005.

Felix Hausdorﬀ. Grundz uge der Mengenlehre. Von Veit, 1914. ISBN 978-3-11-098985-4.

Masaki Hilaga, Yoshihisa Shinagawa, Taku Kohmura, and Tosiyasu L. Kunii. Topology matching for fully automatic similarity estimation of 3D shapes. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, SIGGRAPH 01, pages 203 212, New York, NY, USA, August 2001. Association for Computing Machinery. ISBN 978-1-58113-374-5. doi: 10.1145/383259.383282.

M emoli, Vose, Williamson

Chip Huyen. Designing Machine Learning Systems. O Reilly Media, Inc., May 2022. ISBN 9781098107963.

Laura Iacovissi, Nan Lu, and Robert C. Williamson. A general framework for learning under corruption: Label noise, attribute noise, and beyond, July 2023. ar Xiv:2307.08643.

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning: with Applications in R. Springer Texts in Statistics. Springer US, New York, NY, 2021. ISBN 978-1-07-161417-4 978-1-07-161418-1. doi: 10.1007/ 978-1-0716-1418-1.

Nigel Kalton and Mikhail Ostrovskii. Distances between Banach spaces. Forum Mathematicum, 11, 09 1997. doi: 10.1515/form.11.1.17.

Leonid Kantorovich. On the translocation of masses. Journal of Mathematical Sciences, 133 (4):1381 1382, March 2006. ISSN 1573-8795. doi: 10.1007/s10958-006-0049-2.

Jun Kitagawa and Asuka Takatsu. Two new families of metrics via optimal transport and barycenter problems, December 2023. ar Xiv:2311.15874.

Jon Kleinberg. An impossibility theorem for clustering. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, volume 15. MIT Press, 2002.

Achim Klenke. Probability measures on product spaces. In Achim Klenke, editor, Probability Theory: A Comprehensive Course, Universitext, pages 273 293. Springer, London, 2014. ISBN 978-1-4471-5361-0. doi: 10.1007/978-1-4471-5361-0 14.

Aleksandr Kronrod. On functions of two variables. Uspekhi Mat. Nauk, 5:24 134, 1950.

John Langford. Machine learning reductions. 2009. URL https://hunch.net/~jl/ projects/reductions/reductions.html.

John Langford and Bianca Zadrozny. Estimating class membership probabilities using classiﬁer learners. In Robert G. Cowell and Zoubin Ghahramani, editors, Proceedings of the Tenth International Workshop on Artiﬁcial Intelligence and Statistics, volume R5 of Proceedings of Machine Learning Research, pages 198 205. PMLR, 06 08 Jan 2005.

Kim G Larsen and Arne Skou. Bisimulation through probabilistic testing (preliminary report). In Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 344 352, 1989.

Khang Le, Dung Q Le, Huy Nguyen, Dat Do, Tung Pham, and Nhat Ho. Entropic Gromov Wasserstein between Gaussian distributions. In International Conference on Machine Learning, pages 12164 12203. PMLR, 2022.

Lucien Le Cam. Suﬃciency and approximate suﬃciency. The Annals of Mathematical Statistics, 35(4):1419 1455, 1964. ISSN 00034851.

Geometry and Stability of Supervised Learning Problems

Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861 867, January 1993. ISSN 0893-6080. doi: 10.1016/S0893-6080(05) 80131-5.

David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov Chains and Mixing Times: Second Edition. American Mathematical Society, 2017.

Renjie Liao. Notes on Rademacher complexity, May 2020.

Yi Lin. A note on margin-based loss functions in classiﬁcation. Statistics & Probability Letters, 68(1):73 82, June 2004. ISSN 0167-7152. doi: 10.1016/j.spl.2004.03.002.

Saunders Mac Lane. Categories for the Working Mathematician, volume 5. Springer Science & Business Media, 2013.

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the 40th International Conference on Machine Learning, pages 23803 23828. PMLR, July 2023. ISSN: 2640-3498.

Ester Mariucci. Le Cam theory on the comparison of statistical models, 2016. ar Xiv:1605.03301.

Facundo M emoli. On the use of Gromov-HausdorﬀDistances for Shape Comparison. In M. Botsch, R. Pajarola, B. Chen, and M. Zwicker, editors, Eurographics Symposium on Point-Based Graphics. The Eurographics Association, 2007. ISBN 978-3-905673-51-7. doi: 10.2312/SPBG/SPBG07/081-090.

Facundo M emoli. Gromov Wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, 11(4):417 487, August 2011. ISSN 1615-3375, 1615-3383. doi: 10.1007/s10208-011-9093-5.

Facundo M emoli, Zane Smith, and Zhengchao Wan. The Gromov-Hausdorﬀdistance between ultrametric spaces: its structure and computation. Journal of Computational Geometry, 14(1):78 143, Sep. 2023. doi: 10.20382/jocg.v14i1a4.

Xiao-Li Meng. Statistical paradises and paradoxes in big data (i) law of large populations, big data paradox, and the 2016 US presidential election. The Annals of Applied Statistics, 12(2):685 726, 2018.

Aditya Menon, Brendan Van Rooyen, Cheng Soon Ong, and Bob Williamson. Learning from corrupted binary labels via class-probability estimation. In Proceedings of the 32nd International Conference on Machine Learning, pages 125 134. PMLR, June 2015. ISSN: 1938-7228.

Aditya Krishna Menon, Brendan van Rooyen, and Nagarajan Natarajan. Learning from binary labels with instance-dependent noise. Machine Learning, 107(8):1561 1595, September 2018. ISSN 1573-0565. doi: 10.1007/s10994-018-5715-3.

Robin Milner. A Calculus of Communicating Systems. Springer, 1980.

M emoli, Vose, Williamson

Robin Milner. Communication and Concurrency, volume 84. Prentice hall Englewood Cliﬀs, 1989.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning, second edition. Adaptive Computation and Machine Learning series. MIT Press, 2018. ISBN 978-0-262-35136-2.

Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in Neural Information Processing Systems, volume 26, 2013.

Neur IPS. AI research summit: Neur IPS 2025 receives record-breaking 27,000 paper submissions, 2025. URL https://www.ctol.digital/news/ ai-research-summit-neurips-2025-receives-record-breaking-27000-paper-submissions/. Accessed 9 June 2025.

Richard Nock, Zac Cranko, Aditya K. Menon, Lizhen Qu, and Robert C. Williamson. f-GANs in an information geometric nutshell. Advances in Neural Information Processing Systems, 30, 2017.

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems (Neur IPS), pages 271 279, 2016.

Giuseppe Patan e, Michela Spagnuolo, and Bianca Falcidieno. Para-Graph: Graph-based parameterization of triangle meshes with arbitrary genus. Computer Graphics Forum, 23 (4):783 797, 2004. ISSN 1467-8659. doi: 10.1111/j.1467-8659.2004.00808.x.

Evan Patterson. Hausdorﬀand Wasserstein metrics on graphs and other structured data. Information and Inference: A Journal of the IMA, iaaa025, 2020. doi: 10.1093/imaiai/ iaaa025.

Gabriel Peyr e, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends R in Machine Learning, 11(5-6):355 607, 2019.

Gabriel Peyr e, Marco Cuturi, and Justin Solomon. Gromov-Wasserstein averaging of kernel and distance matrices. In Proceedings of The 33rd International Conference on Machine Learning, pages 2664 2672. PMLR, June 2016. ISSN: 1938-7228.

Allan Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica, 8:143 195, January 1999. ISSN 1474-0508, 0962-4929. doi: 10.1017/S0962492900002919. Publisher: Cambridge University Press.

Joaquin Qui nonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence, editors. Dataset Shift in Machine Learning. Neural Information Processing. The MIT Press, Cambridge, MA, 2009. ISBN 978-0262170055.

Georges Reeb. Sur les points singuliers d une forme de Pfaﬀcompletement integrable ou d une fonction numerique [on the singular points of a completely integrable Pfaﬀform or of a numerical function]. Comptes Rendus Acad. Sciences Paris, 222:847 849, 1946.

Geometry and Stability of Supervised Learning Problems

Mark D. Reid and Robert C. Williamson. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12(22):731 817, 2011.

Brendan van Rooyen and Robert C. Williamson. A theory of learning with corrupted labels. Journal of Machine Learning Research, 18(228):1 50, 2018. ISSN 1533-7928.

Jonas Rothfuss, Fabio Ferreira, Simon Boehm, Simon Walther, Maxim Ulrich, Tamim Asfour, and Andreas Krause. Noise regularization for conditional density estimation. ar Xiv preprint ar Xiv:1907.08982, 2019.

Meyer Scetbon, Gabriel Peyr e, and Marco Cuturi. Linear-time Gromov-Wasserstein distances using low rank couplings and costs. In International Conference on Machine Learning, pages 19347 19365. PMLR, 2022.

Felix Schmiedl. Computational aspects of the Gromov Hausdorﬀdistance and its application in non-rigid shape matching. Discrete & Computational Geometry, 57(4):854 880, June 2017. ISSN 1432-0444. doi: 10.1007/s00454-017-9889-4.

Thibault Sejourne, Francois-Xavier Vialard, and Gabriel Peyr e. The unbalanced Gromov Wasserstein distance: Conic formulation and relaxation. In Advances in Neural Information Processing Systems, volume 34, pages 8766 8779, 2021.

Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227 244, October 2000. ISSN 0378-3758. doi: 10.1016/S0378-3758(00)00115-4.

Y. Shinagawa, T.L. Kunii, and Y.L. Kergosien. Surface coding based on Morse theory. IEEE Computer Graphics and Applications, 11(5):66 78, September 1991. ISSN 1558-1756. doi: 10.1109/38.90568. Conference Name: IEEE Computer Graphics and Applications.

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Vi egas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. ar Xiv preprint ar Xiv:1706.03825, 2017.

Ingo Steinwart. How to compare diﬀerent loss functions and their risks. Constructive Approximation, 26(2):225 287, August 2007. ISSN 1432-0940. doi: 10.1007/s00365-006-0662-3.

Karl-Theodor Sturm. On the geometry of metric measure spaces. Acta Mathematica, 196 (1):65 131, July 2006. ISSN 1871-2509. doi: 10.1007/s11511-006-0002-8.

Karl-Theodor Sturm. The space of spaces: curvature bounds and gradient ﬂows on the space of metric measure spaces, volume 290. American Mathematical Society, 2023.

Dilip Mathew Thomas and Vijay Natarajan. Symmetry in scalar ﬁeld topology. IEEE Transactions on Visualization and Computer Graphics, 17(12):2035 2044, December 2011. ISSN 1941-0506. doi: 10.1109/TVCG.2011.236. Conference Name: IEEE Transactions on Visualization and Computer Graphics.

Erik N. Torgersen. Comparison of Statistical Experiments. Cambridge University Press, 1991. ISBN 978-1-107-08722-4.

M emoli, Vose, Williamson

Lloyd N. Trefethen. Approximation Theory and Approximation Practice, Extended Edition. Society for Industrial and Applied Mathematics, 2019. ISBN 978-1-61197-593-2.

Aad W. Van Der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, New York, NY, 1996. ISBN 978-1-4757-2547-6 978-1-4757-2545-2. doi: 10.1007/978-1-4757-2545-2.

Tim Van Erven, Peter D. Gr unwald, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson. Fast rates in statistical and online learning. The Journal of Machine Learning Research, 16(1):1793 1861, January 2015. ISSN 1532-4435.

titouan Vayer, R emi Flamary, Nicolas Courty, Romain Tavenard, and Laetitia Chapel. Sliced gromov-wasserstein. Advances in Neural Information Processing Systems, 32, 2019.

Titouan Vayer, Laetitia Chapel, R emi Flamary, Romain Tavenard, and Nicolas Courty. Fused Gromov-Wasserstein distance for structured objects. Algorithms, 13:212, 2020a.

Titouan Vayer, Ievgen Redko, R emi Flamary, and Nicolas Courty. CO-optimal transport. In Advances in Neural Information Processing Systems, volume 33, pages 17559 17570, 2020b.

Anatoly Vershik. Kantorovich metric: Initial history and little-known applications. Journal of Mathematical Sciences, 133(4):1410 1417, March 2006. ISSN 1573-8795.

C edric Villani. Topics in Optimal Transportation Theory. American Mathematical Society, 01 2003. doi: 10.1090/gsm/058.

C edric Villani. Optimal Transport Old and New. Springer-Verlag, 01 2008. doi: 10.1007/ 978-3-540-71050-9.

C edric Vincent-Cuaz, R emi Flamary, Marco Corneli, Titouan Vayer, and Nicolas Courty. Semi-relaxed Gromov Wasserstein divergence with applications on graphs. Ar Xiv, abs/2110.02753, 2021.

Robert C. Williamson and Zac Cranko. Information processing equalities and the information risk bridge. Journal of Machine Learning Research, 25(103):1 53, 2024. ISSN 1533-7928.

Zo e Wood, Hugues Hoppe, Mathieu Desbrun, and Peter Schr oder. Removing excess topology from isosurfaces. ACM Transactions on Graphics, 23(2):190 208, April 2004. ISSN 0730-0301. doi: 10.1145/990002.990007.

Lin Yan, Talha Bin Masood, Raghavendra Sridharamurthy, Farhan Rasheed, Vijay Natarajan, Ingrid Hotz, and Bei Wang. Scalar ﬁeld comparison with topological descriptors: Properties and applications for scientiﬁc visualization. Computer Graphics Forum, 40(3):599 633, 2021. ISSN 1467-8659. doi: 10.1111/cgf.14331.

Eugene Zhang, Konstantin Mischaikow, and Greg Turk. Feature-based surface parameterization and texture mapping. ACM Transactions on Graphics, 24(1):1 27, January 2005. ISSN 0730-0301. doi: 10.1145/1037957.1037958.

Geometry and Stability of Supervised Learning Problems

Tong Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, 2004. ISSN 0090-5364. Publisher: Institute of Mathematical Statistics.

Xingquan Zhu and Xindong Wu. Class noise vs. attribute noise: A quantitative study. Artiﬁcial Intelligence Review, 22(3):177 210, November 2004. ISSN 1573-7462. doi: 10.1007/s10462-004-0751-8.