# learning_conditional_distributions_on_continuous_spaces__08f59b74.pdf

Journal of Machine Learning Research 26 (2025) 1-64 Submitted 6/24; Revised 3/25; Published 6/25

Learning conditional distributions on continuous spaces

Cyril B en ezet cyril.benezet@ensiie.fr Universit e Paris-Saclay, CNRS, Univ Evry, ens IIE Laboratoire de Math ematiques et Mod elisation d Evry, 91037, Evry-Courcouronnes, France

Ziteng Cheng ziteng.cheng@utoronto.ca Financial Technology Thrust The Hong Kong University of Science and Technology (Guangzhou) Guangzhou, 511400, China

Sebastian Jaimungal sebastian.jaimungal@utoronto.ca Department of Statistical Sciences University of Toronto Toronto, ON M5G 1Z5, Canada

Editor: Maxim Raginsky

We investigate sample-based learning of conditional distributions on multi-dimensional unit boxes, allowing for diﬀerent dimensions of the feature and target spaces. Our approach involves clustering data near varying query points in the feature space to create empirical measures in the target space. We employ two distinct clustering schemes: one based on a ﬁxed-radius ball and the other on nearest neighbors. We establish upper bounds for the convergence rates of both methods and, from these bounds, deduce optimal conﬁgurations for the radius and the number of neighbors. We propose to incorporate the nearest neighbors method into neural network training, as our empirical analysis indicates it has better performance in practice. For eﬃciency, our training process utilizes approximate nearest neighbors search with random binary space partitioning. Additionally, we employ the Sinkhorn algorithm and a sparsity-enforced transport plan. Our empirical ﬁndings demonstrate that, with a suitably designed structure, the neural network has the ability to adapt to a suitable level of Lipschitz continuity locally. For reproducibility, our code is available at https://github.com/zcheng-a/LCD_k NN. Keywords: non-parametric statistics, Wasserstein distance, deep learning, Lipschitz continuity

1. Introduction

Learning the conditional distribution is a crucial aspect of many decision-making scenarios. While this learning task is generally challenging, it presents unique complexities when explored in a continuous space setting. Below, we present a classic example (cf. Booth et al. (1992); Pﬂug and Pichler (2016)) that highlights this core challenge.

. Ordered alphabetically. . CB gratefully acknowledges ﬁnancial support from the Agence Nationale de la Recherche (Re LISCo P grant ANR21-CE400001). . The majority of ZC s work was completed during the postdoctoral appointment in the Department of Statistical Sciences at the University of Toronto. ZC gratefully acknowledges ﬁnancial support from the Guangzhou HKUST(GZ) Joint Funding Program (Grant No. 2024A03J0630). . SJ would like acknowledge support from the Natural Sciences and Engineering Research Council of Canada (grants RGPIN-2024-04317, ALLRP 550308-20, and RGPIN-2020-04289).

c 2025 Cyril B en ezet, Ziteng Cheng, and Sebastian Jaimungal.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0924.html.

C. B en ezet, Z. Cheng, and S. Jaimungal

For simplicity, we suppose the following model

where the feature variable X and the noise U are independent Uniform([0, 1]), and Y is the target variable. Upon collecting a ﬁnite number of independent samples D = {(Xm, Ym)}M m=1, we aim to estimate the conditional distribution of Y given X. Throughout, we treat this conditional distribution as a measure-valued function of x, denoted by Px. A naive approach is to ﬁrst form an empirical joint measure

m=1 δ(Xm,Ym),

where δ stands for the Dirac meaasure, and then use the conditional distribution induced from ˆψ as an estimator. As the marginal distribution of X is continuous, with probability 1 (as P(Xm = Xm ) = 0 for all m = m ), we have that1

( δYm, x = Xm for some m, Uniform([0, 1]), otherwise.

Regardless of the sample size M, b Px fails to approximate the true conditional distribution,

Px = Uniform [x, x + 1

2] , x [0, 1].

Despite the well-known convergence of the (joint) empirical measure to the true distribution Dudley (1969); Fournier and Guillin (2015), the resulting conditional distribution often fails to provide an accurate approximation of the true distribution. This discrepancy could be due to the fact that calculating conditional distribution is an inherently unbounded operation. As a remedy, clustering is a widely employed technique. Speciﬁcally, given a query point x in the feature space, we identify samples where Xm is close to x and use the corresponding Ym s to estimate Px. Two prominent methods within the clustering approach are the kernel method and the nearest neighbors method2. Roughly speaking, the kernel method relies primarily on proximity to the query point for selecting Xm s, while the nearest neighbors method focuses on the rank of proximity. Notably, discretizing the feature space (also known as quantization), a straightforward yet often eﬀective strategy, can be seen as a variant of the kernel method with static query points and ﬂat kernels. The problem of estimating conditional distributions can be addressed within the non-parametric regression framework, by employing clustering or resorting to non-parametric least squares, among others. Alternatively, it is feasible to estimate the conditional density function directly: a widelyused method involves estimating the joint and marginal density functions using kernel smoothing and then calculating their ratio. This method shares similarities with the clustering heuristics mentioned earlier. For a more detailed review of these approaches, we refer to Section 1.2. This work draws inspiration from recent advancements in estimating discrete-time stochastic processes using conditional density function estimation (Pﬂug and Pichler (2016)) and quantization methods (Backhoﬀet al. (2022); Acciaio and Hou (2023)). A notable feature of these works is their use of the Wasserstein distance to calculate local errors: the diﬀerence between the true and estimated conditional distributions at a query point x. One could average these local errors across

1. In accordance to the model, we set the conditional distribution to Uniform([0, 1]) at points where it is not welldeﬁned. 2. These should not be confused with similarly named methods used in density function estimation.

Learning conditional distributions on continuous spaces

diﬀerent values of x s to gauge the global error. Employing Wasserstein distances naturally frames the study within the context of weak convergence, thereby enabling discussions in a relatively general setting, although this approach may yield somewhat weaker results in terms of the mode of convergence. Moreover, utilizing a speciﬁc distance rather than the general notion of weak convergence enables a more tangible analysis of the convergence rates and ﬂuctuations. We would like to point out that the advancements made in Backhoﬀet al. (2022); Acciaio and Hou (2023), as well as our analysis in this paper, relies on recent developments concerning the Wasserstein convergence rate of empirical measures under i.i.d. sampling from a static distribution (cf. Fournier and Guillin (2015)).

1.1 Main contributions

First, we introduce some notations to better illustrate the estimators that we study. Let X and Y be multi-dimensional unit cubes, with potentially diﬀerent dimensions, for feature and target spaces. For any integer M 1, any D = {(xm, ym)}M m=1 (X Y)M, and any Borel set A X, we deﬁne a probability measure on Y by

PM m=1 1A(xm) 1 PM m=1 1A(xm)δym, PM m=1 1A(xm) > 0,

λY, otherwise, (1)

where λY is the Lebesgue measure on Y and, for y Y, δy is a Dirac measure with atom at y. In general, one could consider weighting δym s (cf. (Gy orﬁet al., 2002, Section 5), (Biau and Devroye, 2015, Chapter 5)), which may oﬀer additional beneﬁts in speciﬁc applications. As such adjustments are unlikely to aﬀect the convergence rate, however, we use uniform weighting for simplicity. With (random) data D = {(Xm, Ym)}M m=1, we aim to estimate the conditional distribution of Y given X. We view this conditional distribution as a measure-valued function P : X P(Y) and use a subscript for the input argument and write Px. Consider a clustering scheme3 given by the map AD : X 2X. We investigate estimators of the form x 7 ˆµD AD(x). We use b P A to denote said estimator and suppress D from the notation for convenience. In later sections, we consider two kinds of maps AD (i) a ball with ﬁxed radius centered at x, called an r-box and (ii) the k nearest neighbors of x, called k-nearest-neighbor estimator. See Deﬁnitions 5 and 9 for more details. One of our main contribution pertains to analyzing the error

X W Px, b P A x ν(dx), (2)

where W is the 1-Wasserstein distance (see (Villani, 2008, Particular Case 6.2)) and ν P(X) is arbitrary and provides versatility to the evaluation criterion. A canonical choice for ν is the Lebesgue measure on X, denoted by λX. This is particularly relevant in control settings where X represents the state-action space and accurate approximations across various state and action scenarios are crucial for making informed decisions. The form of error above is also foundational in stochastic process estimation under the adapted Wasserstein distance (cf. (Backhoﬀet al., 2022, Lemma 3.1)), making the techniques we develop potentially relevant in other contexts. Under the assumption that P is Lipschitz continuous (Assumption 2) and standard assumptions on the data collection process (Assumption 3), we analyze the convergence rate and ﬂuctuation by bounding

3. In general, the clustering scheme may require information on (x1, . . . , x M). For example, clustering the k-nearestneighbor near a query point x requires to know all xm s.

C. B en ezet, Z. Cheng, and S. Jaimungal

the following two quantities

X W Px, b P A x ν(dx) and Var Z

X W Px, b P A x ν(dx) .

Moreover, by analyzing the above quantities, we gain insights into the optimal choice of the clustering mapping A. For the detail statements of these results, we refer to Theorems 7, 10, 8, and 11. We also refer to Section 2.4 for related comments. To illustrate another aspect of our contribution, we note by design x 7 b P A x is piece-wise constant. This characteristic introduces limitations. Notably, it renders the analysis of performance at the worst-case x elusive. Contrastingly, by building a Lipschitz-continuous parametric estimator P Θ from the raw estimator b P A, in Proposition 13 we demonstrate that an upper bound on the aforementioned expectation allows us to derive a worst-case performance guarantee. Guided by Proposition 13, we explore a novel approach of training a neural network for estimation, by using b P A as training data and incorporating suitably imposed Lipschitz continuity. To be comprehensive, we include in Section 1.2 a review of studies on Lipschitz continuity in neural networks. In Section 3.1, we deﬁne P θ as a neural network that approximates P, where θ represents the network parameters. We train P θ with the objective:

n=1 W b P A Xn, P θ Xn

where ( Xn)N n=1 is a set of randomly selected query points. For implementation purposes, we use the k-nearest-neighbor estimator in the place of b P A (see Deﬁnition 9). To mitigate the computational costs stemming from the nearest neighbors search, we employ the technique of Approximate Nearest Neighbor Search with Random Binary Space Partitioning (ANN-RBSP), as discussed in Section 3.1.1. In Section 3.1.2, we compute W using the Sinkhorn algorithm, incorporating normalization and enforcing sparsity for improved accuracy. To impose a suitable level of local Lipschitz continuity on P θ, in Section 3.1.3, we employ a neural network with a speciﬁc architecture and train the networks using a tailored procedure. The key component of this architecture is the convex potential layer introduced in Meunier et al. (2022). In contrast to most extant literature that imposes Lipschitz continuity on neural networks, our approach does not utilize speciﬁc constraint or regularization of the objective function, but relies on certain self-adjusting mechanism embedded in the training. In Section 3.2, we evaluate the performance of the trained P θ, denoted by P Θ, using three sets of synthetic data in 1D and 3D spaces. Our ﬁndings indicate that P Θ generally outperforms b P A, even though it is initially trained to match b P A. This superior performance persists even when comparing P Θ to diﬀerent b P A using various clustering parameters, without retraining P Θ. Furthermore, despite using the same training parameters, P Θ consistently demonstrates the ability to adapt to a satisfactory level of local Lipschitz continuity across all cases. Moreover, in one of the test cases, we consider a kernel that exhibits a jump discontinuity, and we ﬁnd that P Θ handles this jump case well despite Lipschitz continuity does not hold. Lastly, we provide further motivation of our approach by highlighting some potential applications for P Θ. The ﬁrst application is in model-based policy gradient method in reinforcement learning. We anticipate that the enforced Lipschitz continuity allows us to directly apply the policy gradient update via compositions of P Θ and cost function for more eﬀective optimality searching. The second application of P Θ is in addressing optimisation in risk-averse Markov decision processes, where dynamic programming requires knowledge beyond the conditional expectation of the risk-togo (cf. Chow et al. (2015); Huang and Haskell (2017); Coache et al. (2023); Cheng and Jaimungal (2023)). The study of these applications is left for further research.

Learning conditional distributions on continuous spaces

1.2 Related works

In this section, we will ﬁrst review the clustering approach in estimating conditional distributions, and then proceed to review recent studies on Lipschitz continuity in neural networks.

1.2.1 Estimating conditional distributions via clustering

The problem of estimating conditional distributions is frequently framed as non-parametric regression problems for real-valued functions. For instance, when d Y = 1, estimate the conditional α-quantile of Y given X. Therefore, we begin by reviewing some of the works in non-parametric regression. The kernel method in non-parametric regression traces its origins back to the Nadaraya-Watson estimator (Nadaraya (1964); Watson (1964)), if not earlier. Subsequent improvements have been introduced, such as integral smoothing (Gasser and M uller (1979), also known as the Gasser-M uller estimator), local ﬁtting with polynomials instead of constants (Fan (1992)), and adaptive kernels (Hall et al. (1999)). Another signiﬁcant area of discussion is the choice of kernel bandwidth, as detailed in works like Hardle and Marron (1985); Gijbels and Goderniaux (2004); K ohler et al. (2014). Regarding convergence rates, analyses under various settings can be found in Stone (1982); Hall and Hart (1990); Kohler et al. (2009, 2010), with Stone (1982) being particularly relevant to our study for comparative purposes. According to Stone (1982), if the target function is Lipschitz continuous, with i.i.d. sampling and that the sampling distribution in the feature space has a uniformly positive density, then the optimal rate of the 1-distance between the regression function

and the estimator is of the order M 1 d X+2 . For a more comprehensive review of non-parametric regression using kernel methods, we refer to the books such as Gy orﬁet al. (2002); Ferraty and Vieu (2006); Wasserman (2006) and references therein. Non-parametric regression using nearest neighbors methods originated from classiﬁcation problems (Fix and Hodges (1951)). Early developments in this ﬁeld can be found in Mack (1981); Devroye (1982); Bhattacharya and Gangopadhyay (1990). For a comprehensive introduction to nearest neighbors methods, we refer to Gy orﬁet al. (2002). More recent reference such as Biau and Devroye (2015) oﬀers further detailed exploration of the topic. The nearest neighbor method can be viewed as a variant of the kernel method that adjusts the bandwidth based on the number of local data points a property that has gained signiﬁcant traction. Recently, the application of the nearest neighbor method has expanded into various less standard settings, including handling missing data (Rachdi et al. (2021)), reinforcement learning (Shah and Xie (2010); Giegrich et al. (2024)), and time series forecasting (Mart ınez et al. (2017)). For recent advancements in convergence analysis beyond the classical setting, see Zhao and Lai (2019); Padilla et al. (2020); Ryu and Kim (2022); Demirkayaa et al. (2024). Although the review above mostly focuses on clustering approach, other eﬀective approaches exist, such as non-parametric least square, or more broadly, conditional elicitability (e.g., Gy orﬁ et al. (2002); Muandet et al. (2017); Wainwright (2019); Coache et al. (2023)). Non-parametric least square directly ﬁts the data using a restricted class of functions. At ﬁrst glance, this approach appears distinct from clustering. However, they share some similarities in their heuristics: the rigidity of the ﬁtting function, due to imposed restrictions, allows data points near the query point to aﬀect the estimation, thereby implicitly incorporating elements of clustering. Apart from non-parametric regression, conditional density function estimation is another significant method for estimating conditional distributions. One approach is based on estimating joint and marginal density functions, and then using the ratio of these two to produce an estimator for the conditional density function. A key technique used in this approach is kernel smoothing. Employing a static kernel for smoothing results in a conditional density estimator that shares similar clustering

C. B en ezet, Z. Cheng, and S. Jaimungal

heuristics to those found in the kernel method of non-parametric regression. For a comprehensive overview of conditional density estimation, we refer to reference books such as Scott (2015); Simonoﬀ (1996). For completeness, we also refer to (Muandet et al., 2017, Section 5.1) for a perspective on static density function estimation from the standpoint of reproducing kernel Hilbert space. Further discussions on estimation using adaptive kernels can be found in, for example, Bashtannyk and Hyndman (2001); Lacour (2007); Bertin et al. (2016); Zhao and Tabak (2023). Despite extensive research in non-parametric regression and conditional density function estimation, investigations from the perspective of weak convergence have been relatively limited, only gaining more traction in the past decade. Below, we highlight a few recent studies conducted in the context of estimating discrete-time stochastic processes under adapted Wasserstein distance, as the essence of these studies are relevant to our evaluation criterion (2). Pﬂug and Pichler (2016) explores the problem asymptotically, employing tools from conditional density function estimation with kernel smoothing. Subsequently, Backhoﬀet al. (2022) investigates a similar problem with a hypercube as state space, employing the quantization method. Their approach removes the need to work with density functions. They calculate the convergence rate, by leveraging recent developments in the Wasserstein convergence rate of empirical measures Fournier and Guillin (2015). Moreover, a sub-Gaussian concentration with parameter M 1 is established. The aforementioned results are later extended to Rd in Acciaio and Hou (2023), where a non-uniform grid is used to mitigate assumptions on moment conditions. Most recently, Hou (2024) examines smoothed variations of the estimators proposed in Backhoﬀet al. (2022); Acciaio and Hou (2023). Other developments on estimators constructed from smoothed quantization can be found in ˇSm ıd and Kozm ık (2024). Lastly, regarding the machine learning techniques used in estimating conditional distributions, conditional generative models are particularly relevant. For reference, see Mirza and Osindero (2014); Papamakarios et al. (2017); Vaswani et al. (2017); Fetaya et al. (2020). These models have achieved numerous successes in image generation and natural language processing. We suspect that, due to the relatively discrete (albeit massive) feature spaces in these applications, clustering is implicitly integrated into the training procedure. In continuous spaces, under suitable setting, clustering may also become an embedded part of the training procedure. For example, implementations in Li et al. (2020b); Vuleti c et al. (2024); Hosseini et al. (2024) do not explicitly involve clustering and use training objectives that do not speciﬁcally address the issues highlighted in the motivating example at the beginning of the introduction. Their eﬀectiveness could possibly be attributed to certain regularization embedded within the neural network and training procedures. Nevertheless, research done in continuous spaces that explicitly uses clustering approaches when training conditional generative models holds merit. Such works are relatively scarce. For an example of this limited body of research, we refer to Xu and Acciaio (2022), where the conditional density function estimator from Pﬂug and Pichler (2016) is used to train an adversarial generative network for stochastic process generation.

1.2.2 Lipschitz continuity in neural networks

Recently, there has been increasing interest in understanding and enforcing Lipschitz continuity in neural networks. The primary motivation is to provide a certiﬁable guarantee for classiﬁcation tasks performed by neural networks: it is crucial that minor perturbations in the input object have a limited impact on the classiﬁcation outcome. One strategy involves bounding the Lipschitz constant of a neural network, which can then be incorporated into the training process. For reﬁned upper bounds on the (global) Lipschitz constant, see, for example, Bartlett et al. (2017); Virmaux and Scaman (2018); Tsuzuku et al. (2018); Fazlyab et al. (2019); Xue et al. (2022); Fazlyab et al. (2024). For local bounds, we refer to Jordan and

Learning conditional distributions on continuous spaces

Dimakis (2021); Bhowmick et al. (2021); Shi et al. (2022) and the references therein. We also refer to Zhang et al. (2022) for a study of the Lipschitz property from the viewpoint of boolean functions. Alternatively, designing neural network architectures that inherently ensure desirable Lipschitz constants is another viable strategy. Works in this direction include Meunier et al. (2022); Singla et al. (2022); Wang and Manchester (2023); Araujo et al. (2023). Notably, the layer introduced in Meunier et al. (2022) belongs to the category of residual connection (He et al. (2016)). Below, we review several approaches that enforce Lipschitz constants during neural network training. Tsuzuku et al. (2018); Liu et al. (2022) explore training with a regularized objective function that includes upper bounds on the network s Lipschitz constant. Gouk et al. (2021) frame the training problem into constrained optimization and train with projected gradients descent. Given the speciﬁc structure of the reﬁned bound established in Fazlyab et al. (2019), Pauli et al. (2022) combines training with semi-deﬁnite programming. They develop a version with a regularized objective function and another that enforces the Lipschitz constant exactly. Fazlyab et al. (2024) also investigates training with a regularized objective but considers Lipschitz constants along certain directions. Huang et al. (2021) devises a training procedure that removes components from the weight matrices to achieve smaller local Lipschitz constants. Trockman and Kolter (2021) initially imposes orthogonality on the weight matrices, and subsequently enforces a desirable Lipschitz constant based on that orthogonality. Ensuring desirable Lipschitz constants with tailored architectures, Singla et al. (2022); Wang and Manchester (2023) train the networks directly. Although the architecture proposed in Meunier et al. (2022) theoretically ensures the Lipschitz constant, it requires knowledge of the spectral norm of the weight matrices, which does not admit explicit expression in general. Their training approach combines power iteration for spectral norm approximation with the regularization methods used in Tsuzuku et al. (2018). Finally, we note that due to their speciﬁc application scenarios, these implementations concern relatively stringent robustness requirements and thus necessitate more speciﬁc regularization or constraints. In our setting, it is generally desirable for the neural network to automatically adapt to a suitable level of Lipschitz continuity based on the data, while also avoiding excessive oscillations from over-ﬁtting. The literature directly addressing this perspective is limited (especially in the setting of conditional distribution estimation). We refer to Bai et al. (2021); Bountakas et al. (2023); Cohen et al. (2019) for discussions that could be relevant.

1.3 Organization of the paper

Our main theoretical results are presented in Section 2. Section 3 is dedicated to the training of P Θ. We will outline the key components of our training algorithm and demonstrate its performance on three sets of synthetic data. We will prove the theoretical results in Section 4. Further implementation details and ablation analysis are provided in Section 5. In Section 6, we discuss the weaknesses and potential improvements of our implementation. Appendix B and C respectively contain additional plots and a table that summarizes the conﬁguration of our implementation. Additionally, Appendix D includes a rougher version of the ﬂuctuation results.

Notations and terminologies

Throughout, we adopt the following set of notations and terminologies.

On any normed space (E, ), for all x E and γ > 0, B(x, γ) denotes the closed ball of radius γ around x, namely B(x, γ) = {x E | x x γ}.

For any measurable space (E, E ), P(E) denotes the set of probability distributions on (E, E ). For all x E, δx P(E) denotes the Dirac mass at x.

C. B en ezet, Z. Cheng, and S. Jaimungal

We endow normed spaces (E, ) with their Borel sigma-algebra B(E), and W denotes the 1-Wasserstein distance on P(E).

On X = [0, 1]d, we denote by λX the Lebesgue measure. We say a measure ν P(X) is dominated by Lebesgue measure with a constant C > 0 if ν(A) CλX(A) for all A B([0, 1]d).

The symbol denotes equivalence in the sense of big O notation, indicating that each side dominates the other up to a multiplication of some positive absolute constant. More precisely, an bn means there are ﬁnite constants c, C > 0 such that

c an bn C an, n N.

Similarly, implies that one side is of a lesser or equal, in the sense of big O notation, compared to the other.

2. Theoretical results

In Section 2.1, we ﬁrst formally set up the problem and introduce some technical assumption. We then study in Section 2.2 and 2.3 the convergence and ﬂuctuation of two versions of b P A, namely, the r-box estimator and the k-nearest-neighbor estimator. Related comments are organized in Section 2.4. Moreover, in Section 2.5, we provide a theoretical motivation for the use of P Θ, the Lipschitz-continuous parametric estimator trained from b P A.

For d X, d Y 1 two integers, we consider X := [0, 1]d X and Y := [0, 1]d Y, endowed with their respective sup-norm .

Remark 1 The sup-norm is chosen for simplicity of the theoretical analysis only: as all norms on Rn are equivalent (for any generic n 1), our results are valid, up to diﬀerent multiplicative constants, for any other choice of norm.

We aim to estimate an unknown probabilistic kernel

x 7 Px(dy).

To this end, given an integer-valued sample size M 1, we consider a set of (random) data points D := {(Xm, Ym)}M m=1 associated to P. We also deﬁne the set of projections of the data points onto the feature space as DX := {Xm}M m=1. Throughout this section, we work under the following technical assumptions.

Assumption 2 (Lipschitz continuity of kernel) There exists L 0 such that, for all (x, x ) X2,

W(Px, Px ) L x x .

Assumption 3 The following is true:

Learning conditional distributions on continuous spaces

(i) D is i.i.d.with probability distribution ψ := ξ P, where ξ P(X) and where ξ P P(X Y) is (uniquely, by Caratheodory extension theorem) deﬁned by

(ξ P) (A B) := Z

X 1A(x)Px(B)ξ(dx), A B(X), B B(Y).

(ii) There exists c (0, 1] such that, for all A B(X), ξ(A) c λX(A).

These assumptions allow us to analyze convergence and gain insights into the optimal clustering hyper-parameters without delving into excessive technical details. Assumption 2 is mainly used for determining the convergence rate. If the convergence rate is not of concern, it is possible to establish asymptotic results with less assumptions. We refer to Devroye (1982); Backhoﬀet al. (2022) for relevant results. The conditions placed on ξ in Assumption 3 are fairly standard, though less stringent alternatives are available. For instance, Assumption 3 (i) can be weakened by considering suitable dependence Hall and Hart (1990) or ergodicity in the context of stochastic processes Rudolf and Schweizer (2018). Assumption 3 (ii), implies there is mass almost everywhere and is aligned with the motivation from control settings discussed in the introduction. Assumptions 2 and 3 are not exceedingly stringent and provides a number of insights into the estimation problem. More general settings are left for further research. The estimators discussed in subsequent sections are of the form b P A, as introduced right after (1), for two speciﬁc choices of clustering schemes A constructed with the data D.

Remark 4 In the following study, we assert all the measurability needed for b P A to be well-deﬁned. These measurability can be veriﬁed using standard measure-theoretic tools listed in, for example, (Aliprantis and Border, 2006, Section 4 and 15).

2.2 Results on r-box estimator

The ﬁrst estimator, which we term the r-box estimator, is deﬁned as follows.

Deﬁnition 5 Choose r, a real number, s.t. 0 < r < 1

2. The r-box estimator for P is deﬁned by

ˆP r : X P(Y)

x 7 ˆP r x := ˆµD Br(x),

where, for all x X, Br(x) := B(βr(x), r) and βr(x) := r x (1 r), where r and (1 r) are applied entry-wise.

Remark 6 The set Br(x) is deﬁned such that it is a ball of radius around x whenever x is at least r away from the boundary X (in all of its components), otherwise, we move the point x in whichever components are within r from X to be a distance r away from X. Consequently, for all 0 < r < 1

2 and for all x X, Br(x) has a bona ﬁde radius of r, as the center βr(x) is smaller or equal to r away from X.

For the r-box estimator, we have the following convergence results. The theorem below discusses the convergence rate of the average Wasserstein distance between the unknown kernel evaluated at any point and its estimator, when the radius r is chosen optimally with respect to the data sample M. Section 4.2 is dedicated to its proof.

C. B en ezet, Z. Cheng, and S. Jaimungal

Theorem 7 Under Assumptions 2 and 3, choose r as follows

M 1 d X+2 , d Y = 1, 2

M 1 d X+d Y , d Y 3.

Then, there is a constant C > 0 (which depends only on d X, d Y, L, c), such that, for all probability distribution ν P(X), we have

X W Px, ˆP r x ν(dx) sup x X E h W Px, ˆP r i C

M 1 d X+2 , d Y = 1,

M 1 d X+2 ln(M), d Y = 2,

M 1 d X+d Y , d Y 3.

Next, we bound the associated variance whose proof is postponed to Section 4.3.

Theorem 8 Under Assumptions 3, consider r (0, 1

2]. Let ν P(X) be dominated by λX with a constant C > 0. Then,

X W Px, ˆP r x dν(x) 4d X+1C 2

2.3 Results on k-nearest-neighbor estimator

Here, we focus in the second estimator the k-nearest-neighbor estimator, deﬁned as follows.

Deﬁnition 9 Let k 1 an integer. The k-nearest-neighbor estimator for P is deﬁned by

ˇP k : X P(Y)

x 7 ˇP k x := ˆµD N k,DX(x),

where, for any integer M 1 and any DX XM, N k,DX(x) contains (exactly) k points of DX which are closest to x, namely

N k,DX(x) := n x DX x x is among the k-smallest of ( x x )x DX o ,

Here, in case of a tie when choosing the k-th smallest, we break the tie randomly with uniform probability.

We have the following analogs of the convergence results (Theorems 7 and 8) for the k-nearestneighbor estimator. The proofs are postponed to Section 4.4 and Section 4.5, respectively.

Theorem 10 Under Assumptions 2 and 3, and choosing k as

2 d X+2 , d Y = 1, 2,

d Y d X+d Y , d Y 3,

there is a constant C > 0 (which depends only on d X, d Y, L, c), such that, for all probability distribution ν P(X), we have

X W Px, ˇP k x ν(dx) sup x X E h W Px, ˇP k x i C

M 1 d X+2 , d Y = 1,

M 1 d X+2 ln M, d Y = 2,

M 1 d X+d Y , d Y 3.

Learning conditional distributions on continuous spaces

Theorem 11 Under Assumptions 3, for any ν P(X), we have

X W Px, ˇP k x ν(dx) 1

Moreover, if ν is dominated by λX with a constant C > 0, then

X W Px, ˇP k x ν(dx)

22d X+1C 2M c2k2

M 1 + k M 1

M 1 + k M 1

With k chosen as in Theorem 10, this reduces to

X W Px, ˇP k x ν(dx)

d X+d Y ln(M), 2 d Y d X, M 1, 2 d Y > d X.

2.4 Comments on the convergence rate

This section gathers several comments on the convergence results we have developed in Sections 2.2 and 2.3.

2.4.1 On the convergence rate

We ﬁrst comment on the expectations in Theorem 7 and 10. Sharpness of the bounds. Currently, we cannot establish the sharpness of the convergence rates in Theorems 7 and 10. However, we can compare our results to established results in similar settings. For d Y = 1, we may compare it to the optimal rate of non-parametric regression of a

Lipschitz continuous function. It is shown in Stone (1982) that the optimal rate is M 1 d X+2 , the same as in Theorems 7 and 10 when d Y = 1. For d Y 3, as noted in Backhoﬀet al. (2022), we may compare to the Wasserstein convergence rate of empirical measure in the estimation of a static distribution on Rd X+d Y. We refer to Fournier and Guillin (2015) for the optimal rate, which coincides with those in Theorems 7 and 10. Error components. We discuss the composition of our upper bound on the expected average Wasserstein error by dissecting the proof of Theorem 7 and 10. In the proofs, we decompose the expected average errors into two components: approximation error and estimation error. The approximation error occurs when treating Px as equal to Px when x is close to the query point x, leading to an error of size L x x . The estimation error is associated with the Wasserstein error of empirical measure under i.i.d. sampling (see (21)). From Deﬁnitions 5 and 9, the r-box estimator eﬀectively manages the approximation error but struggles with controlling the estimation error, whereas the k-nearest-neighbor estimator exhibits the opposite behavior. Explicit bounds. We primarily focus on analyzing the convergence rates of the r-box and k-nearest-neighbor estimators as M . Therefore, within the proofs of these results, we track only the rates (and ignore various constant coeﬃcients). If more explicit bounds are preferred, intermediate results such as (23), or (27) could be good starting points for computing them. Additionally, in Section A, we provide a numerical illustration that highlights the impact of the Lipschitz constant L and the choices of r and k on the bounds.

C. B en ezet, Z. Cheng, and S. Jaimungal

2.4.2 On the fluctuation

We next discuss the variances studied in Theorems 8 and 11. In Appendix D, we also include results derived from the Azuma-Hoeﬀding inequality (e.g., (Wainwright, 2019, Corollary 2.20)), though they provide rougher rates. Condition that ν is dominated by λX. In Theorems 8 and 11, we assume that the ν is dominated by λX. This assumption is somewhat necessary. To illustrate, let us examine the nonparametric regression problem under a comparable scenario. We consider a ﬁxed query point. In this context, the central limit theorem for k-nearest-neighbor estimator is well-established, and the normalizing rate is k 1

2 (cf. (Biau and Devroye, 2015, Theorem 14.2)). This suggests that the rate in (5) is sharp. For the r-box estimator, we believe that a supporting example can be constructed where ν is highly concentrated. On the other hand, we conjecture that if ξ ν, the variance could potentially attain the order of M 1. For a pertinent result, we direct the reader to (Backhoﬀet al., 2022, Theorem 1.7). Sharpness of the bounds. Regarding the variance in Theorem 8, it is upper bounded by the commonly observed order of M 1. We believe that this rate is sharp, though we do not have a proof at this time. As for Theorem 11, the variance is subject to a rougher rate when 2 d Y d X. We, however, conjecture that this variance attains the order of M 1 as long as ν is dominated by λX.

2.5 Towards implementation with neural networks

In light of recent practices in machine learning, during the learning of P, we may combine the r-box method or k-nearest-neighbor method into the training of certain parameterized model. To this end we let

P : T X P(Y)

(θ, x) 7 P θ x

be a parameterized model (e.g., a neural network), where T is the parameter space and θ T is the parameter to be optimized over. Given an integer N 1, we may train P θ on a set of query points Q = ( Xn)N n=1 satisfying the assumption below.

Assumption 12 The query points Q = {( Xn)}N n=1 are i.i.d. with uniform distribution over X, and are independent of the data points D = {(Xm, Ym)}M m=1.

We propose the training objectives below

arg min θ T

n=1 W ˆP r Xn, P θ Xn

or arg min θ T

n=1 W ˇP k Xn, P θ Xn

that is, minimize the mean of 1-Wasserstein errors between the parametrized model and the empirical r-box (or k-nearest-neighbour) approximation of the conditional distribution at the location of the random query points. The following proposition together with Theorem 7 or Theorem 10 justiﬁes using the objectives in (6). It is valid for any estimator for P that satisﬁes the bounds in (3) or (4). Centered on appropriate Lipschitz continuity conditions, the proposition oﬀers insights into the worst-case performance guarantees. The proof is deferred to Section 4.6. We also refer to Altekr uger et al. (2023) for a worst-case performance guarantee for conditional generative models, which is contingent upon Lipschitz continuity. For related practical approaches, we refer to, for example, Nguyen et al. (2024) and the references therein. In contrast, similar guarantees for the r-box and k-nearest-neighbor estimators are more elusive due to their inherently piece-wise constant nature.

Learning conditional distributions on continuous spaces

Proposition 13 Suppose Assumptions 2, 3, and 12 hold. Let P of P be an estimator constructed using the data points D only. Consider a training procedure that produces a (random) Θ = Θ(D, Q) satisfying

W( P Θ x , P Θ x ) x x LΘ (7)

for some (random) LΘ > 0. Then,

X W(Px, P Θ x ) dx E

X W(Px, P x) dx + E

P Xn, P Θ Xn

Moreover, with probability 1,

sup x X W Px, P Θ x (d X + 1)

1 d X+1 (L + LΘ)

d X d X+1 Z

X W(Px, P Θ x ) dx 1 d X+1 . (9)

Remark 14 Assuming LΘ L for some (deterministic) L > 0, by (9) and Jensen s inequality, we have

E sup x X W Px, P Θ x (d X + 1)

1 d X+1 (L + L)

d X d X+1 E Z

X W(Px, P Θ x ) dx 1 d X+1 .

This together with (8) provides a worst-case performance guarantee for P Θ.

Remark 15 Proposition 13 along with Remark 14 provides insights into the worst-case performance guarantees, but more analysis is needed. Speciﬁcally, understanding the magnitude of LΘ

and E h 1 N PN n=1 W(P Xn, P Θ Xn) i requires deeper knowledge of the training processes for P Θ, which are currently not well understood in the extant literature. Alternatively, in the hypothetical case where P Θ = P, LΘ would match L as speciﬁed in Assumption 2, and E h 1 N PN n=1 W

P Xn, P Θ Xn

would obey Theorem 7 or 10. However, practical applications must also consider the universal approximation capability of P θ. To the best of our knowledge, research on universal approximation with regularity constraints remains relatively limited. For a somewhat related study, we refer to Hong and Kratsios (2024) who explore the approximation of real-valued functions under Lipschitz continuity constraints.

3. Implementation with neural networks

Let X and Y be equipped with 1. Following the discussion in Section 2.5, we let P θ : X P(Y) be parameterized by a neural network and develop an algorithm that trains P θ based on k-nearestneighbor estimator. The k-nearest-neighbor estimator ˇP k is preferred as ˇP k x consistently outputs k atoms. This regularity greatly facilities implementation. For instance, it enables the use of 3D tensors during Sinkhorn iterations to enhance execution speed (see Section 3.1.2 later). We refer also to the sparsity part of Section 5.2 for another component that necessitates the aforementioned regularity of ˇP k. These components would not be feasible with the r-box estimator ˆP r, as ˆP r x

C. B en ezet, Z. Cheng, and S. Jaimungal

produces an undetermined number of atoms. Furthermore, there is a concern that in some realizations, ˆP r x at certain x may contain too few data points, potentially leading P Θ x to exhibit unrealistic concentration. We next provide some motivation for this implementation. For clarity, we refer to the r-box estimator and the k-nearest-neighbor estimator as raw estimators. Additionally, we refer to P Θ, once trained, as the neural estimator. While raw estimators are adequate for estimating P on their own, they are piece-wise constant in x by design. On the other hand, a neural estimator is continuous in x. This continuity provides a performance guarantee in sup W distance, as outlined in Proposition 13 and the following remark. Moreover, the neural estimator inherently possesses gradient information. As discussed in the introduction, this feature renders the neural estimators useful in downstream contexts where gradient information is important, e.g., when performing model-based reinforcement learning. We construct P θ such that it maps x X to atoms in Y with equal probabilities. For the related universal approximation theorems, we refer to Kratsios (2023); Acciaio et al. (2024). We represent these atoms with a vector with Natom entries denoted by yθ(x) = (yθ 1(x), . . . , yθ Natom(x)) YNatom, where Natom N is chosen by the user. In our implementation, we set Natom = k. To be precise, we construct P θ such that

P θ x = 1 Natom

j=1 δyθ j (x), x N. (10)

This is known as the Lagrangian discretization (see (Peyr e and Cuturi, 2019, Section 9)). In Algorithm 1, we present a high level description of our implementation of training P θ based on the raw k-nearest-neighbor estimator.

Algorithm 1 Deep learning conditional distribution in conjunction with k-NN estimator

Input: data {(Xm, Ym)}M m=1 valued in Rd X Rd Y, neural estimator P θ represented by yθ(x) as elaborated in (10), parameters such as k, Natoms, Nbatch N+, and learning rate ηθ Output: trained parameter Θ for the neural estimator

1: repeat 2: for n = 1, . . . , Nbatch do 3: generate a query point Xn Uniform(X) 4: ﬁnd the k nearest neighbors of Xn from data (Xm)M m=1 and collect accordingly ( Yn,i)k i=1 5: end for 6: compute with Sinkhorn algorithm (Y is equipped with 1)

i=1 δ Yn,i, 1 Natom

j=1 δyθ j ( Xn)

7: update θ θ ηθ θL[θ] 8: until Convergence 9: return Θ = θ

3.1 Overview of key components

In this section, we outline the three key components of our implementation. Each of these components addresses a speciﬁc issue:

Managing the computational cost arising from the nearest neighbors search.

Implementing gradient descent after computing W.

Learning conditional distributions on continuous spaces

Figure 1: An instance of RBSP in [0, 1]2.

The 2D unit box is partitioned into 16 rectangles based on 500 samples from Uniform([0, 1]). Note that the overlap between the bounding rectangles is intentionally maintained. Each partitioning is performed along an axis selected at random, dividing the samples within the pre-partitioned rectangle according to a random ratio drawn from Uniform([0.45, 0.55]). The edge ratio for mandatory bisecting along the longest edge is 5. If this ratio is exceeded, partitioning along the longest edge is enforced. The black dots represent samples within the respective rectangle.

Selecting an appropriate Lipschitz constant for the neural estimator, preferably at a local level.

Further details and ablation analysis on these three components can be found in Section 5.

3.1.1 Approximate Nearest Neighbors Search with Random Binary Space Partitioning (ANNS-RBSP)

Given a query point, performing an exact search for its k-nearest-neighbor requires O(M) operations. While a single search is not overly demanding, executing multiple searches as outlined in Algorithm 1 can result in signiﬁcant computational time, even when leveraging GPU-accelerated parallel computing. To address this, we use ANNS-RBSP as a more cost-eﬀective alternative. Prior to searching, we sort (Xm)M m=1 along each axis and record the order of indices. During the search, the data is divided into smaller subsets by repeatedly applying bisection on these sorted indices, with a random bisecting ratio, on a randomly chosen axis. Furthermore, we apply a restriction that mandates bisection along the longest edge of a rectangle when the edge ratio exceeds certain value (a hyper-parameter of the model). We record the bounding rectangle for each subset created through this partitioning process. Once partitioning is complete, we generate a small batch of query points within each rectangle and identify the k nearest neighbors for each query point within that same rectangle. For a visual representation of ANNS-BSP, we refer to Figure 1. Leveraging the sorted indices, we can reapply this partitioning method during every training episode without much computational cost. We refer to Section 5.1 for additional details. There are similar ideas in the extant literature (cf. Hajebi et al. (2011); Ram and Sinha (2019); Li et al. (2020a)). Given the substantial diﬀerences in our setting, however, we conduct further empirical analysis in Section 5.1 to showcase the advantage of our approach against exact search.

C. B en ezet, Z. Cheng, and S. Jaimungal

3.1.2 Computing W for gradient descent

The following discussion pertains to the computation of (11), with the subsequent gradient descent in consideration. For simplicity, let us focus on the summand and reduce the problem to the following minimization. Let ( y1, . . . , yk) Yk be ﬁxed, we aim to ﬁnd

arg min y Yn W

i=1 δ yi, 1

The criterion in (12) is convex as W is convex in both arguments (cf. (Villani, 2008, Theorem 4.8)). To solve (12), as is standard, we cast it into a discrete optimal transport problem. To do so, ﬁrst introduce the (k n)-cost matrix Cy, where Cy,ij := yi yj 1. As the criterion in (12) has uniform weights on the atoms, we next aim to solve the problem

arg min T [0,1]k n

(i,j) {1,...,k} {1,...,n} Tij Cy,ij

j=1 Tij = 1

k, i = 1, . . . , k and

i=1 Tij = 1

n, j = 1, . . . , n.

Let T y be an optimal transport plan that solves (13) for y ﬁxed. Taking derivative of y 7 ϕy( ) yields

yjϕy(T) T=T y = X

i {1,...,k} T y,ij yj yi yj 1, j = 1, . . . , n. (14)

This gradient is in general not the gradient corresponding to (12), as T y depends on y, while (14) excludes such dependence. Nevertheless, it is still viable to update y using the gradient descent that employs the partial gradient speciﬁed in (14). To justify this update rule, ﬁrst consider y Y satisfying ϕy (T y) ϕy(T y), then observe that

i=1 δ yi, 1

ϕy (T y) ϕy(T y) = W

i=1 δ yi, 1

This inequality is strict if ϕy (T y) < ϕy(T y). We refer to (Peyr e and Cuturi, 2019, Section 9.1) and the reference therein for related discussions. The Sinkhorn algorithm, which adds an entropy regularization, is a widely-used algorithm for approximating the solution to (13). Speciﬁcally, here, it is an iterative scheme that approximately solves the following regularized problem, subject to the constraints in (13),

arg min Tϵ [0,1]k n

i,j {1,...,k} {i,...,n} Tϵ ij Cij + ϵ X

i,j {1,...,k} {i,...,n} Tϵ ij(log Tij 1)

where ϵ > 0 is a hyper-parameter, and should not be confused with the ε used elsewhere. We refer to Section 5.2 for further details. We also refer to (Peyr e and Cuturi, 2019, Section 4) and the reference therein for convergence analysis of the Sinkhorn algorithm. It is well known that the regularization term in (15) is related to the entropy of a discrete random variable. Larger values of ϵ encourages the regularized optimal transport plan to be more diﬀusive. That is, for larger values

Learning conditional distributions on continuous spaces

of ϵ, the mass from each yj is distributed more evenly across all yi s. Performing gradient descent along the direction in (14) tends to pull yj s towards the median of the yi s, as we are equipping Y with the norm 1. Conversely, small values of ϵ often leads to instability, resulting in Na N loss/gradient. To help with these issues, we implement the Sinkhorn algorithm after normalizing the cost matrix. Additionally, we use a large ϵ (e.g., 1) in the ﬁrst few training episodes, then switch to a smaller ϵ (e.g., 0.1) in later episodes. Furthermore, we impose sparsity on the transport plan by manually setting the smaller entries of the transport plan to 0. The speciﬁc detailed conﬁgurations and related ablation analysis are provided in Section 5.2 and Appendix C.

3.1.3 Network structure that induces locally adaptive Lipschitz continuity

As previously discussed, it is desirable for the neural estimator to exhibit certain Lipschitz continuity. In practice, however, determining an appropriate Lipschitz constant for training the neural estimator P θ is challenging, largely because understanding the true Lipschitz continuity of P (if it exists) in a data-driven manner is very challenging. Additionally, the estimate provided in Proposition 13 is probabilistic. Fortunately, a speciﬁc network structure allows the neural estimator, when properly trained, to exhibit locally adaptive Lipschitz continuity. Subsequently, we provide a highlevel overview of this network structure. Further detailed conﬁgurations and ablation analysis are presented in Section 5.3 and Appendix C. Consider a fully connected feed-forward neural network with equal width hidden layers and layer-wise residual connection (He et al. (2016)). Let Nneuron denote the width of the hidden layers. For activation, we use Exponential Linear Unit (ELU) function (Clevert et al. (2016)), denoted by σ. For hidden layers, we employ the convex potential layer introduced in Meunier et al. (2022),

xout = xin W 1 2 WTσ (Wxin + b) . (16)

By (Meunier et al., 2022, Proposition 3), the convex potential layer is 1-Lipschitz continuous in 2 sense. For the input layer, with a slight abuse of notation, we use

xout = N 1 neurondiag(|W| 1 1 1)σ (Wxin + b) , (17)

where |W|1 computes the absolute sum of each row of the weight matrix to form a vector of size Nneuron, the reciprocal and 1 are applied entry-wise, and diag produces a diagonal square matrix based on the input vector. In short, the normalization in (17) is only applied to the rows of W with ℓ1-norm exceeding 1. Consequently, the input layer is 1-Lipschitz continuous in 1 sense. A similar treatment is used for the output layer but without activation,

xout = L d 1 Y diag(|W| 1 1 1) (Wxin + b) . (18)

where L > 0 is a hyper-parameter. The output represents atoms on Y with uniform weight, therefore, no N 1 atom is required here. The spectral norm W 2 in (16), however, does not, in general, have an explicit expression. Following the implementation in Meunier et al. (2022), we approximate each W 2 with power iteration. Power iterations are applied to all hidden layers simultaneously during training. To control the pace of iterations, we combine them with momentum-based updating. We refer to Algorithm 2 for the detailed implementation. Our implementation diﬀers from that in Meunier et al. (2022), as Meunier et al. (2022) controls the frequency of updates but not the momentum. In a similar manner, for input and output layers, instead of calculating the row-wise ℓ1-norm explicitly, we update them with the same momentum used in the hidden layers. Our numerical experiments consistently show that a small momentum value of τ = 10 3 eﬀectively maintains

C. B en ezet, Z. Cheng, and S. Jaimungal

Algorithm 2 Power iteration with momentum for updating W 2 estimate, applied to all convex potential layers simultaneously at every epoch during training

Input: weight matrix W Rd d of a convex potential layer, previous estimate ˆh R and auxiliary vector ˆu Rd, momentum τ (0, 1) Output: updated ˆh and ˆu, in particular, ˆh will be used as a substitute of W 2 in (16)

1: v Wˆu/ Wˆu 2 2: u WTv/ WTv 2 3: h 2/ P

4: ˆh τˆh + (1 τ)h 5: ˆu τˆu + (1 τ)u 6: return ˆh, ˆu

power iteration momentumbased updating

adaptive continuity while maintaining a satisfactory accuracy. The impact of L in (18) and τ in Algorithm 2 is discussed in Section 5.3. During training, due to the nature of our updating schemes, the normalizing constants do not achieve the values required for the layers to be 1-Lipschitz continuous. We hypothesize that this phenomenon leads to a balance that ultimately contributes to adaptive continuity: on one hand, the weights W stretch to ﬁt (or overﬁt) the data, while on the other, normalization through iterative methods prevents the network from excessive oscillation. As shown in Section 5.3.2 and 5.3.3, the L value in (18) and the momentum τ in Algorithm 2 aﬀect the performance signiﬁcantly. For completeness, we also experiment with replacing (16) by fully connected feedforward layers similar to (17), with or without batch normalization (Ioﬀe and Szegedy (2015)) after aﬃne transformation. This alternative, however, failed to produce satisfactory results.

3.2 Experiments with synthetic data

We consider data simulated from three diﬀerent models. The ﬁrst two have d X = d Y = 1, while the third has d X = d Y = 3. Here we no longer restrict Y to be the unit box, however, we still consider X to be a d X-dimensional unit box (not necessarily centered at the origin). In Model 1 and 2, X Uniform([0, 1]). Model 1 is a mixture of two independent Gaussian random variables with mean and variance depending on x,

Y = ξ 0.1 1 + cos(2πX) + 0.12 1 cos(2πX) Z + 0.5 ,

where Z Normal(0, 1) and ξ is a Rademacher random variable independent of Z. For Model 2, we have

Y = 0.5 1[0,1)(X) + 0.5 U,

where U Uniform([0, 1]). The conditional distribution in Model 2 is intentionally designed to be discontinuous in the feature space. This choice was made to evaluate performance in the absence of the Lipschitz continuity stipulated in Assumption 2. Model 3 is also a mixture of two independent Gaussian random variables, constructed by considering X Uniform([ 1

2]3) and treating X as a column vector (i.e., X take values in R3 1),

Y = ζ cos(AX) + 0.1 cos(ΣX)W + (1 ζ) cos(A X) + 0.1 cos(Σ X)W .

Above, the cos functions act on vector/matrix entrywise, A R3 3, and Σx also takes value in R3 3. Each element of Σx is deﬁned as vijx for some vij R1 3. The entries of A and vij are drawn from

Learning conditional distributions on continuous spaces

standard normal in advance and remain ﬁxed throughout the experiment. The matrices A and Σ x are similarly constructed. Furthermore, W and W are independent three-dimensional standard normal r.v.s, while ζ represents the toss of a fair coin, independent of X, W, and W . For the purpose of comparison, two diﬀerent network structures are examined. The ﬁrst, termed Lip Net, is illustrated in Section 3.1. The second, termed Std Net, is a fully connected feedforward network with layer-wise residual connections He et al. (2016), Re LU activation, and batch normalization immediately following each aﬃne transformation, without speciﬁcally targeting Lipschitz continuity. With a hyper-parameter k for the k-nearest-neighbor estimator, which we specify later, each network contains 5 hidden layers with 2k neurons. These networks are trained using the Adam optimizer Kingma and Ba (2017) with a learning rate of 10 3. For Std Net in Model 1 and 2, the learning rate is set to 0.01, as it leads to better performance. Other than the learning rates, Std Net and Lip Net are trained with identical hyper-parameters across all models. We refer to Appendix C for a summary of hyper-parameters involved. We generate 104 samples for Models 1 and 2. Given the convergence rate speciﬁed in Theorem 10, we note that the sample size are considered relatively small. For these two models, we chose k = 100 and utilized neural networks P θ that output atoms of size Natom = k. The choice of k is determined by a rule of thumb. In particular, our considerations include the magnitude of k suggested by Theorem 10 and the computational costs associated with the Sinkhorn iterations discussed in Section 3.1.2. The results under Model 1 and 2 are plotted in Figure 2, 3 and 4. Figure 2 provides a perspective on joint distributions, while Figure 3 and 4 focus on conditional CDFs across diﬀerent x values. Figure 2 suggets that both Std Net and Lip Net adequately recover the joint distribution. The Lip Net s accuracy is, however, notably superior and produces smooth movements of atoms (as seen in the third row of Figure 2). Although further ﬁne-tuning may provide slight improvements in Std Net s performance, Std Net will still not achieve the level of accuracy and smoothness observed in Lip Net. The average absolute value of derivative of each atom (fourth row of Figure 2), makes it evident that Lip Net demonstrates a capacity of automatically adapting to a suitable level of Lipschitz continuity locally. In particular, in Model 2, the atoms of Lip Net respond promptly to jumps while remaining relatively stationary around values of x where the kernel is constant. We emphasize that Lip Net is trained using the same hyper-parameters across Models 1, 2, and 3. Figure 3 shows the estimated conditional distribution at diﬀerent values of x. Figure 3 indicates that the raw k-nearest-neighbor estimator deviates frequently from the actual CDFs. This deviation of the raw k-nearest-neighbor estimator is expected, as it attempts to estimate an unknown CDF with only k = 100 samples given an x. Conversely, the neural estimator, especially the Lip Net, appears to oﬀer extra corrections even if they are trained based on the raw k-nearest-neighbor estimator. This could be attributed to neural estimators implicitly leveraging information beyond the immediate neighborhood. Figure 4 compares the W-distance between each estimator and the true conditional distribution at various values of x, using the following formula (see (Peyr e and Cuturi, 2019, Remark 2.28)),

W(F, G) = Z

F(r) G(r) dr, (19)

where F and G are CDFs. This quantity can be accurately approximated with trapezoidal rule. In Model 1, the neural estimator generally outperforms the raw estimator with k = 100 across most values of x, even though the raw estimator is used for training the neural estimators. Furthermore, Lip Net continues to outperform raw estimators with larger values of k even though Lip Net is trained with a raw estimator with k = 100. In Model 2, Lip Net continues to demonstrate a superior performance, except when compared to the raw estimator with k = 1, 000 at x distant from 0.5,

C. B en ezet, Z. Cheng, and S. Jaimungal

All atoms scattered

Traj. of 20 atoms

Ave. abs. der.

(a) Model 1, Std Net

(b) Model 1, Lip Net

(c) Model 2, Std Net

(d) Model 2, Lip Net

Figure 2: Various estimators under Model 1 and 2, joint distributions.

The ﬁrst row presents the data. Neural networks with diﬀerent structures are trained on the same set of data for comparison. The second row shows scatter plots of atoms at various x values. The third row illustrates the evolution of 20 atoms as x varies. The ﬁnal row presents the average of the derivative of each atom with respect to x, with a notable diﬀerence in the y axis scale.

Learning conditional distributions on continuous spaces

Model 1 Model 2

Figure 3: Various estimators under Model 1 and 2, conditional CDFs.

We compare the conditional CDFs at various values of x, derived from Std Net, Lip Net, the raw k-nearest estimator with k = 100 (also used in the training of Std Net and Lip Net), and the ground truth. The ﬁrst row pertains to data set 1. The second row pertains to data set 2. Subﬁgure titles display the values of x.

Model 1 Model 2

Figure 4: Errors at diﬀerent x s of various estimators under Model 1 and 2.

We compute the W-distance between estimators and the true conditional distribution at diﬀerent x s. Std Net and Lip Net are trained based on the raw k-nearest-neighbor estimator with k = 100.

C. B en ezet, Z. Cheng, and S. Jaimungal

the reason is that, here, the conditional distribution is piece-wise constant in x, which enhances the performance of the raw estimator at larger k values. The aforementioned ﬁndings indicate superior performance by Lip Net. We, however, recognize that improvements are not always guaranteed, as demonstrated in Figures 3 and 4.

Figure 5: Various estimators under Model 3, projections of conditional CDFs.

We compare the projected conditional CDFs at x = (0.12, 0.33, 0.1). The estimations are obtained from Std Net, Lip Net, the raw estimator with k = 300 (also used in training Std Net and Lip Net), and the ground truth. Subﬁgure titles display the vectors used for projection. Note the diﬀerence in the x axis scale.

For Model 3, we generate 106 samples and select k = 300. We train both neural estimators using Adam optimizer with a learning rate of 10 3. Hyperparameters such as L in (18) and τ in Algorithm 2 are consistent with those used for Models 1 and 2. We refer to Table 4 for the detailed conﬁguration. In Figure 5, we visualize the outcomes in Model 3: the conditional CDFs at an arbitrarily chosen x are projected onto various vectors. We observe that the neural estimators considerably outperform the raw k-nearest-neighbor estimator, likely owing due to their implicit use of global information outside of the immediate neighbors during training. For further comparisons, we present additional ﬁgures in Appendix B: Figures 15, 16 and 17 feature the exact same neural estimators as shown in Figure 5, but with the raw k-nearest-neighbor estimators employing diﬀerent k values, k = 1, 000, 3, 000, 10, 000. Raw k-nearest-neighbor estimators with k = 1, 000, 3, 000 are superior to that with k = 300, while at k = 10, 000, the accuracy begins to decline. Upon comparison,

Learning conditional distributions on continuous spaces

Raw k = 1, 000

Raw k = 3, 000

Figure 6: Histogram of 10, 000 projected Wasserstein-1 errors. Each histogram consists of 20 uniformly positioned bins between 0 to 0.1. The errors of diﬀerent estimators are computed with the same set of query points and projection vectors. Errors larger than 0.1 will be placed in the right-most bins. Note Std Net and Lip Net are trained with k = 300.

the neural estimator trained with k = 300 consistently outperforms the raw k-nearest-neighbor estimators for all values of k.

Figure 7: Lip Net under Model 3, projected trajectories of 20 atoms.

We illustrate the projected trajectories of 20 atoms by evaluating Lip Net at 100 evenly allocated points along the straight line that intersects the origin and x = (0.12, 0.33, 0.1), situated within [0, 1]3. The x-axis denotes the speciﬁc points along the line, consistent across all subﬁgures. Subﬁgure titles display the vectors used for projection. Note the diﬀerence in the y axis scale.

For a more comprehensive comparison, we randomly select 10, 000 query points. For each query point, we randomly generate a vector in R3, normalized under 1, and project the atoms produced by the estimators onto said vector. With the same vector, we also compute the corresponding true CDFs of the projected Y given the query point. We then approximately compute the W-distance

C. B en ezet, Z. Cheng, and S. Jaimungal

between the projected distributions via (19). The resulting histograms are shown in Figure 6, which suggests that Lip Net performs best. The rationale for employing this projection approach, rather than directly computing the W-distance between discrete and continuous distributions over R3, is due to the higher cost and lower accuracy of the latter approach (see also the discussion in Section 5.2). While this projection approach provides a cost-eﬀective alternative for performance evaluation, it may not fully capture the diﬀerences between the estimations and ground truth.

Lastly, to demonstrate how atoms, in the neural estimator, move as x varies, Figure 7 shows the projected trajectories along a randomly selected straight line through the origin. The movement of atoms in Lip Net is smooth, consistent with previous observations. Interestingly, the movement of atoms in Std Net isn t excessively oscillatory either, although its continuity is slightly rougher compared to Lip Net. The reader may execute the Jupyter notebook on our github repisitory

https://github.com/zcheng-a/LCD_k NN to explore the projected conditional CDFs and atoms trajectories for diﬀerent x values.

4.1 Auxiliary notations and lemmas

In this section, we will introduce a few technical results that will be used in the subsequent proofs.

We ﬁrst deﬁne

R(m) := sup x X

ℓ=1 Px(dyℓ), m N. (20)

We stipulate that R(0) = 1. By Fournier and Guillin (2015), we have

2 , d Y = 1, m 1

2 ln(m), d Y = 2,

d Y , d Y 3,

for some constant C > 0 depending only on d Y. For comprehension, we also point to Kloeckner (2020); Fournier (2023) for results that are potentially useful in analyzing explicit constant, though it is out of the scope of this paper.

The lemma below pertains to the so-called approximation error, which arises when treating data points Yj with Xj around an query point as though they are generated from the conditional distribution at the query point.

Lemma 16 Under Assumption 2, for any integer J 1 and x, x1, . . . , x J XJ+1, we have

j=1 δyj, Px

j=1 Pxj(dyj) Z

j=1 δyj, Px

j=1 Px(dyj)

Learning conditional distributions on continuous spaces

Proof For x, x1, . . . , x J XJ+1, note that

j=1 δyj, Px

j=1 Pxj(dyj) Z

k=1 δyj, Px

j=1 Px(dyj)

j=1 δyj, Px

j=1 Px(dyj)

j=ℓ Pxj(dyj)

j=1 δyj, Px

j=1 Px(dyj)

j=ℓ+1 Pxj(dyj)

where for the sake of neatness, at ℓ= 1, J, we set

j=1 Px(dyj)

j=1 Pxj(dyj) =

j=1 Pxj(dyj) and

j=1 Px(dyj)

j=J+1 Pxj(dyj) =

j=1 Pxj(dyj).

Regarding the ℓ-th summand, invoking Fubini-Toneli theorem to integrate yℓﬁrst then combining the integrals on outer layers using linearity, we obtain

j=1 δyj, Px

j=1 Px(dyj)

j=ℓ Pxj(dyj)

j=1 δyj, Px

j=1 Pj(dyj)

j=ℓ+1 Pxj(dyj)

j=1 δyj, Px

(Px Pxℓ) (dyℓ)

j=1 d Px(yj)

j=ℓ+1 Pxj(dyj)

sup (yj)j =ℓ YJ 1

j=1 δyj, Px

(Pxℓ Px) (dyℓ)

J W(Pxℓ, Px) L

where in the second last inequality we have invoked Kantorovich-Rubinstein duality (cf. (Villani, 2008, Particular case 5.16)) and the fact that, for all (yj)j =ℓ YJ 1, the map yℓ7 W 1 J PJ j=1 δyj, Px

J -Lipschitz, and where in the last equality, we have used Assumption 2.

We will be using the lemma below, which regards the stochastic dominance between two binomial random variables.

Lemma 17 Let n N and 0 p < p 1. Then, Binomial(n, p ) stochastically dominates Binomial(n, p).

Proof Let U1, . . . , Un i.i.d. Uniform[0, 1] and deﬁne

i=1 1[0,p](Ui), H :=

i=1 1[0,p ](Ui).

C. B en ezet, Z. Cheng, and S. Jaimungal

Clearly, H Binomial(n, p) and H Binomial(n, p ). Moreover, we have H H , and thus P(H > r) P(H > r), which completes the proof.

4.2 Proof of Theorem 7

The proof of Theorem 7 relies the technical lemma below that we state and prove now.

Lemma 18 Let p [0, 1] a real number, and let M 1 and d 1 two integers. We then have

pm(1 p)M mm 1

d ((M + 1)p) 1

d + ((M + 1)p) 1 .

Proof We compute

pm(1 p)M mm 1

d = 1 (M + 1)p

M + 1 m + 1

pm+1(1 p)M m(m + 1)m 1

= 1 (M + 1)p

pm(1 p)M+1 mm(m 1) 1

= 1 (M + 1)p

pm(1 p)M+1 m(m 1)1 1

+ 1 (M + 1)p

pm(1 p)M+1 m(m 1) 1

where we used that m = m 1 + 1 in the last equality. Then, using that (m 1)1 1

d and (m 1) 1

d 1 for all m 2, we continue to obtain

pm(1 p)M mm 1

d 1 (M + 1)p

pm(1 p)M+1 mm1 1

+ 1 (M + 1)p

pm(1 p)M+1 m

pm(1 p)M+1 mm1 1

d + 1 (M + 1)p,

where the second term in the last equality are derived from the binomial formula. Finally, introducing a random variable V with binomial distribution B(M + 1, p), and using Jensen inequality for the concave function R+ x 7 x1 1

d R+, we obtain

pm(1 p)M mm 1

d 1 (M + 1)p E h V 1 1

d i + 1 (M + 1)p

((M + 1)p)1 1

(M + 1)p + 1 (M + 1)p = ((M + 1)p) 1

d + ((M + 1)p) 1 ,

which conclude the proof.

Learning conditional distributions on continuous spaces

We are now ready to prove Theorem 7. Proof [Proof of Theorem 7] For ν P(X), we obviously have

X W(Px, ˆP r x)ν(dx) sup x X E h W(Px, ˆP r x) i ,

we then focus on proving the right hand side inequality in Theorem 7. To this end, we ﬁx x X and, to alleviate the notations, we let B := Br(x) as introduced in Deﬁnition 5. Let NB := PM m=1 1B(Xm). By Deﬁnition 5 and Assumption 3 (i), we have

E h W(Px, ˆP r x) i = E W(Px, ˆµD B) =

m=0 E 1NB=m W(Px, ˆµD B)

1X1,...,Xm B1Xm+1,...,XM BW

l=1 δYl, Px

+ P [X1, . . . , XM B] W(λY, Px)

P [Xm+1, . . . , XM B] E

1X1,...,Xm BW

l=1 δYl, Px

+ ξ(Bc)MR(0)

l=1 δyl, Px

ℓ=1 ψ(dxℓdyℓ) + ξ(Bc)MR(0). (22)

To compute the integral terms, observe that, for ﬁxed m 1, by deﬁnition of R(m) in (20), Lemma 16 and Remark 6, Z

l=1 δyl, Px

ℓ=1 ψ(dxℓdyℓ) = Z

l=1 δyl, Px

l=1 Pxl(dyl)

l=1 δyl, Px

ℓ=1 Px(dyℓ) + L

Bm (R(m) + 2Lr)

ℓ=1 ξ(dxℓ) = (R(m) + 2Lr)ξ(B)m.

This together with (22) implies that, for any x X,

E h W(Px, ˆP r x) i

ξ(Bc)M mξ(B)m(R(m) + 2Lr) + ξ(Bc)MR(0)

ξ(Bc)M mξ(B)m R(m) + ξ(Bc)MR(0). (23)

The remainder of the proof is split into three cases. In order to proceed, we will put together (21), Lemma 18, and (23). Below we only keep track of the rate.

For d Y = 1, we have

E h W(Px, ˆP r x) i 2Lr + (ξ(B)(M + 1)) 1

2 + (ξ(B)(M + 1)) 1 + (1 ξ(B))M

2Lr + c(2r)d X(M + 1) 1

2 + c(2r)d X(M + 1) 1 + e c Mrd X.

C. B en ezet, Z. Cheng, and S. Jaimungal

Controlling the dominating term(s) by setting r r d X

2 , we yield

r M 1 d X+2 and E h W(Px, ˆP r x) i M 1 d X+2 .

For d Y = 2, we have

E h W(Px, ˆP r x) i 2Lr + ln(M) (ξ(B)(M + 1)) 1

2 + (ξ(B)(M + 1)) 1 + (1 ξ(B))M

2Lr + ln(M) c(2r)d X(M + 1) 1

2 + c(2r)d X(M + 1) 1 + e c Mrd X.

Since r ln(M)r d X

2 may not have a closed-form solution, we simply follow the case of d Y = 1 to yield

r M 1 d X+2 and E h W(Px, ˆP r x) i M 1 d X+2 ln M.

For d Y 3, we have

E h W(Px, ˆP r x) i 2Lr + (ξ(B)(M + 1)) 1

d Y + (ξ(B)(M + 1)) 1 + (1 ξ(B))M

2Lr + c(2r)d X(M + 1) 1

d Y + c(2r)d X(M + 1) 1 + e c Mrd X.

By setting r r d X

d Y , we yield

r M 1 d X+d Y and E h W(Px, ˆP r x) i M 1 d X+d Y .

The proof is complete.

4.3 Proof of Theorem 8

Proof [Proof of Theorem 8] We will proceed by using Efron-Stein inequality. Let (X 1, Y 1) be an independent copy of (X1, Y1), and deﬁne D := {(X 1, Y 1), (X2, Y2), . . . , (XM, YM)}. In view of Assumption 3 (i), by the triangle inequality of W, it is suﬃcient to investigate

X W ˆµD Br(x), ˆµD Br(x) dν(x) 2#

Notice that, by deﬁnitions (1), n ˆµD Br(x) = ˆµD Br(x) o n X1 Br(x) o n X 1 Br(x) o .

Additionally, by deﬁnitions (1) again, on the event that n ˆµD Br(x) = ˆµD Br(x) o , we have

W ˆµD Br(x), ˆµD Br(x)

ℓ=2 1Br(x)(Xℓ)

Learning conditional distributions on continuous spaces

The above together with the condition that ν is dominated by λX implies that

X W ˆµD Br(x), ˆµD Br(x) ν(dx) 2#

B(X1,2r) B(X 1,2r) W ˆµD Br(x), ˆµD Br(x) λX(dx)

B(X1,2r) B(X 1,2r)

ℓ=2 1Br(x)(Xℓ)

ℓ=2 1Br(x)(Xℓ)

λX(B(X1, 2r))2

ℓ=2 1Br(x)(Xℓ)

! 1 λX(dx) λX(B(X1, 2r))

4C 2(4r)2d XE

ℓ=2 1Br(x)(Xℓ)

! 2 λX(dx) λX(B(X1, 2r))

where we have used Jensen s inequality and tower property in the last line. In view of Assumption 3 (i), expanding the inner conditional expectation into an integral with respect to regular conditional distribution (cf. (Bogachev, 2007, Section 10)) then invoking Fubini-Tonelli theorem, we yield

ℓ=2 1Br(x)(Xℓ)

! 2 λX(dx) λX(B(X1, 2r))

ℓ=2 1Br(x)(xℓ)

ℓ=2 ξ(dxℓ) λX(dx) λX(B(X1, 2r)). (25)

For the inner integral in (25), by Assumption 3 (ii), we have

ℓ=m+1 1Br(x)(xℓ)

ξ Br(x) ℓ 1 ξr Br(x) M 1 ℓ (1 + ℓ) 2

M(M + 1)ξ Br(x) 2

ξ Br(x) ℓ+2 1 ξ Br(x) M 1 ℓℓ+ 2

M(M + 1)ξ Br(x) 2

ξ Br(x) ℓ 1 ξ Br(x) M+1 ℓ

M(M + 1)ξ Br(x) 2 .

This together with (24), (25) and Assumption 3 (ii) implies

X W ˆµD Br(x), ˆµD Br(x) dν(x) 2#

c2M(M + 1).

C. B en ezet, Z. Cheng, and S. Jaimungal

Invoking Efron-Stein inequality, we conclude the proof.

4.4 Proof of Theorem 10

In order to prove Theorem 10, we ﬁrst establish a few technical lemmas. The following lemma is a ﬁrst step toward ﬁnding the average rate of k-nearest neigbhor method.

Lemma 19 Suppose Assumption 2 and 3. Let R be deﬁned in Section 4.1. Then, for any x X, we have

E h W Px, ˇP k x i R(k) + L

m=1 E h Z(m) x i ,

where Zx (m), m = 1, . . . , M are the order statistics of ( Xm x )M m=1 in ascending order.

Proof We ﬁx x X for the rest of the proof. By Assumption 3, we have

E h W Px, ˇP k x i = M! E

1 X1 x X2 x XM x W

(X Y)M 1 x1 x x2 x x M x W

ℓ=1 ψ(dxℓdyℓ)

XM 1 x1 x x2 x x M x

ℓ=1 Pxℓ(dyℓ)

j=1 ξ(dxℓ).

In view of Lemma 16, replacing Pxℓabove with Px, we have

E h W Px, ˇP k x i

XM 1 x1 x x2 x x M x

ℓ=1 Px(dyℓ)

XM 1 x1 x x2 x x M x d X(xℓ, x)

l=1 δyl, Px

ℓ=1 Px(dyℓ)

XM 1 x1 x x2 x x M x d X(xℓ, x)

j=1 ξ(dxℓ).

In view of R deﬁned above (21) and Zx (m) deﬁned in the statement of this lemma, we conclude the proof.

The next lemma provides an upper bound to Pk m=1 E h Z(m) x i listed in Lemma 19.

Lemma 20 Let Zx (m) be deﬁned as in Lemma 19. Under Assumption 3, for any x X, we have

m=1 E h Zx (m) i 2

Learning conditional distributions on continuous spaces

Proof For any x X, we compute, since Zx (m) [0, 1],

E h Zx (m) i = Z 1

0 P h Zx (m) r i dr = Z 1

1 P h Zx (m) < r i dr,

and we observe that n Zx (m) < r o = {N(x, r) m} with N(x, r) := {1 m M | Xm x < r}. We hence have

E h Zx (m) i = Z 1

0 (1 P [N(x, r) m]) dr.

Since N(x, r) Binomial(M, ξ(B(x, r))) and ξ(B(x, r)) cλX(B(x, r)) c rd X

2d X by Assumption 3

(ii), we obtain that P [N(x, r) m] P [N (x, r) m] with N (x, r) Binomial(M, c rd X

2d X ) due to Lemma 17. This implies

E h Zx (m) i Z 1

1 P N (x, r) m dr = Z 1

0 P N (x, r) < m dr

1 d X +j 1 (1 r)M j dr

Γ(M + 1) Γ(j + 1)Γ(M j + 1)

d X + j)Γ(M j + 1)

d X + M + 1)

1 d X d XΓ( 1

d X + M + 1)

and the proof is over.

We are now in position to prove Theorem 10. Proof [Proof of Theorem 10] By combining Lemma 19 and Lemma 20, noting that the upper bound is constant in x, we have

sup x X E h W Px, ˆµN k(x) i R(k) + L

1 d X d XΓ(M + 1

Below we only investigate the rate of the right hand side of (27) as M , and do not keep track of the constant. We ﬁrst analyze the second term in the right hand side of (27). By Gautschi s inequality (Merkle, 2008, (10.6)), we have

j! = Γ(j + 1

1 d X 1, j {0} N.

1 d X k1+ 1

C. B en ezet, Z. Cheng, and S. Jaimungal

By Gautschi s inequality (Merkle, 2008, (12.2)) again, we have

d X + 1) = Γ(M + 1) Γ(M + 1

d X + 1) M 1

The above implies

sup x X E h W Px, ˆµN k(x) i R(k) + M 1

We will split the remainder of the proof into three cases.

For d Y = 1, by letting k 1

1 d X , we yield

2 d X+2 and sup x X E h W Px, ˆµN k(x) i M 1 d X+2

For d Y = 2, since the explicit solution of k 1

1 d X is elusive, we simply follow the conﬁguration derived in the case of d Y = 1 and yield

2 d X+2 and sup x X E h W Px, ˆµN k(x) i M 1 d X+2 ln M.

For d Y 3, by letting k 1

1 d X , we yield

d Y d X+d Y and sup x X E h W Px, ˆµN k(x) i M 1 d X+d Y .

The proof is complete.

4.5 Proof of Theorem 11

Proof [Proof of Theorem 11] We will proceed by using Efron-Stein inequality. Let (X 1, Y 1) be an independent copy of (X1, Y1), and deﬁne D := {(X 1, Y 1), (X2, Y2), . . . , (XM, YM)}. In view of Assumption 3 (i), by the triangle inequality of W, it is suﬃcient to investigate

X W ˆµD N k,DX(x), ˆµD

N k,D X(x) ν(dx) 2#

Note that for W ˆµD N k,DX(x), ˆµD

N k,D X(x) to be positive, the event Ax A x is necessary, where

Ax := n X1 N k,DX(x) o and A x := n X 1 N k,DX(x) o .

W ˆµD N k,DX(x), ˆµD

N k,D X(x) 1

Learning conditional distributions on continuous spaces

It follows that

X W ˆµD N k,DX(x), ˆµD

N k,D X(x) ν(dx) 2#

X 1Ax A xν(dx) 2#

X 1Ax A xν(dx) 2

X P [Ax] ν(dx).

where the second inequality is due to the fact that the integral value always fall into in [0, 1], and we have used Fubini-Tonelli theorem and the subadditivity of probability in the third inequality. Regarding P [Ax], by the symmetry stemming from Assumption 3 (i) and the random tie-breaking rule in Deﬁnition 9, we have

P [Ax] = M 1 k 1

Consequently,

X W ˆµD N k,DX(x), ˆµD

N k,D X(x) ν(dx) 2#

Invoking Efron-Stein inequality, we conclude the proof of (5). We now assume additionally that ν CλX to prove the second statement. Following from (30), by using the positivity and subadditivity of indicator functions as well as AM GM inequality, we have

X W ˆµD N k,DX(x), ˆµD

N k,D X(x) ν(dx) 2#

X 1Axν(dx) 2#

X 1AxλX(dx) 2#

X 1AxλX(dx) 2 > δ

where in the second inequality we have used the condition that ν is dominated by λX, and in the last one the alternative expression of expectation for positive random variables. Let Cubeι X be the set of cubes within X with edge length ι. Since ν is dominated by λX, with probability 1 we have

Ax = n at most (k 1) of Xℓ, ℓ= 2, . . . , M, falls into B X1 x x o ,

A x = n at most (k 1) of Xℓ, ℓ= 2, . . . , M, falls into B X 1 x x o .

It follows that ( M X

m=2 1B(Xm) > k, B Cubeι X with B X1

X 1Axλ(dx) (2ι)d X .

By combining the above and setting δ = (2ι)2d X, we yield

X W ˆµD N k,DX(x), ˆµD

N k,D X(x) ν(dx) 2#

m=2 1B(Xm) k M 1, B Cube

1 2 δ 1 d X X

C. B en ezet, Z. Cheng, and S. Jaimungal

In order to proceed, we state and prove a useful technical lemma using the Rademacher complexity technique (cf. (Wainwright, 2019, Section 4)). Below we let Cube X be the set of cubes inside X with edge lengths within [0, 1].

Lemma 21 Let X2, . . . , XM be introduced in Assumption 3 (i). For ε 0,

m=2 1B(Xm) cλX(B) 8

M 1 ε, B Cube X

Proof Let x M = (x M 2 , . . . , x M M) XM 1. To utilize the machinery of Rademacher complexity, we will upper bound the cardinality of the set {1B(x M) : B Cube X}, where 1B applies entrywise. More precisely, 1B(x M) = (1B(x M 2 ), . . . , 1B(x M M)). To start with, we ﬁrst note that for d = 1, . . . , d X, the projected (x M 2,d, . . . , x M M,d) at most separates axis-d into M intervals. Additionally, each element in {1B(x M) : B Cube X} corresponds to selecting two intervals (one for starting and one for ending of the cube) on each axis. Therefore, the cardinality is at most M2d X, i.e., Cube X has polynomial discrimination 2d X. It follows from (Wainwright, 2019, Lemma 4.14 and Theorem 4.10) that, for any ε 0,

sup B Cube X

m=2 1B(Xm) ξ(B)

Finally, in view of Assumption 3 (ii), we conclude the proof of Lemma 21.

In view of (31) and Lemma 21, for δ [0, 1], we consider ε 0 such that

k M 1 = cδ 1 2 2d X 8

Note that this is feasible only if 4d X

M 1 + k M 1

2 1.4 It follows that

m=2 1B(Xm) k M 1, B Cube

1 2 δ 1 2d X X

M 1 + k M 1

cδ 1 2 2d X 8 q

M 1 + k M 1

The above together with (31) implies

X W ˆµD N k,DX(x), ˆµD

N k,D X(x) ν(dx) 2#

M 1 + k M 1

M 1 + k M 1

4. We do not include this condition in the statement of Theorem 11, as the bound presented remains valid, albeit vacuous, if this condition is not met.

Learning conditional distributions on continuous spaces

where we have performed a change of variable η = δ 1 2 in the last line. Relating to exponential and normal density functions, we calculate the integral to obtain

X W ˆµD N k,DX(x), ˆµD

N k,D X(x) ν(dx) 2#

2C 2M k2 4d X

M 1 + k M 1

M 1 + k M 1

where we note the right hand side is of O

ing Efron-Stein inequality, we conclude the proof.

4.6 Proof of Proposition 13

Proof [Proof of Proposition 13] By triangle inequality,

X W(Px, P Θ x ) dx

X W(Px, P Θ x )

n=1 W(P Xn, P Xn)

n=1 W(P e Xn, P Θ Xn)

Then, by Assumption 2 and (7),

X W(Px, P Θ x )

In view of Assumption 12, we have

n=1 W(P Xn, P Xn)

n=1 E h E h W(P Xn, P Xn) Xn ii = E Z

X W(Px, P x) dx .

Combining the above, we prove the ﬁrst statement. As for the second statement, consider Q, Q : X P(Y) that are Lipschitz-continuous with constants L, L . Suppose that

W(Qx , Q x ) = sup x X W(Qx, Q x) = δ

for some δ > 0 and x X. This supremum is indeed attainable because X is compact that x 7 W(Qx, Q x) is continuous. Consequently, by triangle inequality and the Lipschitz-continuity, we have

W(Qx, Q x) W(Qx, Q x ) W(Q x, Q x ) 0

W(Qx , Q x ) W(Qx , Qx) W(Q x , Q x) 0

δ (L + L ) x x 0, x X.

C. B en ezet, Z. Cheng, and S. Jaimungal

We may then lower bound R

X W(Qx, Q x) dx with the volume of the cone on right hand side above (note the worst case is when x = (0, 0)),

X W(Qx, Q x) dx Z δ

d X dz = δd X+1

(d X + 1)(L + L )d X .

It follows that

sup x X W(Qx, Q x) (d X + 1)

1 d X+1 (L + L )

d X d X+1 Z

X W(Qx, Q x) dx 1 d X+1 ,

which completes the proof.

5. Implementation details and ablation analysis

In this section, we will provide further implementation details and conduct ablation analysis of the components highlighted in Section 3.1.

5.1 Comparing ANNS-RBSP to exact NNS

Algorithm 3 outlines a single slice of RBSP, which divides an array of x s into two arrays of a random ratio along a random axis. Throughout the training, we execute RBSP 5 times during each training epoch, yielding 25 = 32 parts. Within each part, we then select a small batch of 8 query points, locating the k nearest neighbors for each query point within the same part. In Table 1, we compare the execution times of exact NNS and ANNS-RBSP. ANNS-RBSP oﬀers considerable time savings for M = 106, while exact NNS is more eﬃcient for M = 105 or fewer.

Algorithm 3 Single slice of random binary space partitioning

Input: data DX = (xi)M i=1 [0, 1]d X, arrays of indexes Sd, d = 1, . . . , d X of length M with the j-th entry indicating the position of the j-th smallest value in the d-th dimension of DX, a boolean array B of length M with the i-th entry indicating whether xi is involved in the current slicing, a rectangle R that bounds xi s involved in the current slicing, i.e., R corresponds to B, a parameter redge (1, ) for avoiding thin rectangles, an interval [p, p] (0, 1) for random bisecting ratio Output: two boolean arrays B, B of length M indicating the bisected data, two bounding rectangles R, R that correspond to B, B

1: Randomly pick a dimension d 2: if The edge ratio of R exceeds redge then 3: Replace d with that corresponds to the longest edge 4: end if 5: Rearrange B according to Sd by B B[Sd] 6: Pick out the indexes from Sd involved in ANNS by Sd Sd[ B] 7: Generate p Uniform([p, p]) and round p into p so that p len( Sd) is an integer 8: Bisect Sd in two arrays with length p len( Sd) and (1 p)len( Sd), denoted by Sd and S d 9: Form new bounding rectangles R, R using Sd, S d, DX and the original R (may enforce some overlap here) 10: Initialize two boolean arrays B, B with length M and all entries being False 11: B[ Sd] True, B [ S d] True 12: return B, B , R, R

It s important to note that ANNS-RBSP may introduce additional errors by inaccurately including points that are not within the k nearest neighbors. As elucidated in the proof of Theorem 10, the magnitude of this induced error can be understood by comparing the excessive distance

Learning conditional distributions on continuous spaces

Table 1 Execution times of two NNS methods.

k=300 M = 104 M = 105 M = 106

d X = 1 (0.1, 9.2) (3.7, 11.0) (19.5, 11.2) d X = 3 (0.2, 9.8) (4.2, 11.2) (24.6, 11.3) d X = 10 (0.4, 9.9) (5.9, 11.4) (52.4, 11.6) k=1000 M = 104 M = 105 M = 106

d X = 1 (0.1, 7.2) (3.8, 10.9) (19.6, 11.1) d X = 3 (0.2, 7.2) (4.2, 11.2) (24.7, 11.4) d X = 10 (0.4, 7.4) (5.9, 11.4) (52.8, 11.5)

This table compares the execution times for 500 runs of exact NNS versus ANNS-RBSP, both utilizing parallel computing, facilitated by Py Torch, with an NVIDIA L40 GPU. Each iteration (approximately) ﬁnds the 300 nearest neighbors from M samples for all of 256 randomly generated query points. The values within each parenthesis denote the seconds consumed by both methods, with the ﬁrst number corresponding to exact NNS. For faster processing, exact NNS employs a 3D tensor. ANNS-RBSP regenerates a new partition each run. The table does not include the time required to sort the data along all dimensions, which takes about 0.2 seconds in the worst case and is not repeatedly executed.

incurred to that of exact NNS. For simplicity, we investigate the diﬀerence below

:= 1 Nbatch

ˇX ij Xi 1 1

where Xi s are query points, and ˇXij, ˇX ij are the k-nearest-neighbor identiﬁed by exact NNS and ANNS-RBSP, respectively. In our experiments, we evaluated scenarios with d X = 3, 10 and k = 300. Regarding the data, we generated M = 104, 105, 106 samples from Uniform([0, 1]d X). Once the data set is generated, we ﬁxed the data and conducted 100 simulations of , each with Nbatch = 256 query points. This process was repeated 10 times, each with a separately generated data. The results are illustrated in Figure 8. It is expected that will approach 0 as the sample size M tends to inﬁnity. The convergence rate is likely inﬂuenced by factors such as d X, k, and Nbatch. Further analysis of the convergence of ANNS-RBSP will be conducted in future studies.

5.2 An implementation of the Sinkhorn algorithm

In this section, we will detail our implementation of the Sinkhorn algorithm and highlight a few novel treatments that seem to enhance the training of the neural estimator. While the mechanisms are not yet fully understood, they constitute important improvement in the accuracy of the neural estimator. Let us ﬁrst recall the iterative procedure involved in the Sinkhorn algorithm. We follow the setup in Section 3.1.2. In particular, the row indexes of the cost matrix stand for atoms in the empirical measures, while the column indexes stand for atoms produced by the neural estimator. We set Natom = k and let u(0), v(0) be column vectors of size k with all entries being k 1. We will suppress the dependence on y from the notation. Upon setting

Kϵ := exp C

C. B en ezet, Z. Cheng, and S. Jaimungal

(a) d X = 3

(b) d X = 10

Figure 8: Empirical CDFs of .

We compare the empirical CDFs of . Each line corresponding to a independently generated set of data. Each plot includes 10 empirical CDFs. Note the diﬀerence in the x axis scale.

Learning conditional distributions on continuous spaces

with entry-wise exponential, the Sinkhorn algorithm performs repeatedly

u(ℓ+1) = u(0)

Kϵv(ℓ) and v(ℓ+1) = v(0)

(Kϵ) u(ℓ+1) , (32)

where the division is also calculated entry-wise. After a certain number of iterations, denoted as Niter, we obtain an approximate optimal transport plan for problem (15):

Tϵ = diag(u(Niter))Kϵ diag(v(Niter)).

Let us set ϵ = 1 momentarily. Note that if the entries of C are excessively large, K eﬀectively becomes a zero matrix, which impedes the computations in (32). This issue may occur at the initiation of the neural estimator or during training, possibly due to the use of stochastic gradient descent. To tackle this issue, we employ a rule-of-thump normalization that

Kϵ := exp C

with c := min i max j Cij, (33)

and use Kϵ instead of Kϵ in (32). Regarding the selection of ϵ and the number of iterations, we currently lack a method for adaptively determining these values. Instead, we adjust them manually based on training episodes. This manual adjustment works well for all models discussed in this paper. For more information, please see Appendix C. As alluded in Section 3.1.2, we enforce sparsity on the transport plan to improve the performance of the neural estimator. Let Tϵ be the output of the Sinkhorn algorithm. We construct ˆTϵ and ˇTϵ by setting the row-wise and column-wise maximum of Tϵ to k 1, respectively, and setting the remaining entries to 0. We then use

T ϵ = γˆTϵ + (1 γ)ˇTϵ, (34)

where γ [0, 1] is a hyper-parameter, in gradient descent (14). We observe that ˆTϵ relates each atom in the empirical measure to a single corresponding atom from the neural estimator, and ˇTϵ

does the same in reverse. The optimal choice of γ remains an open question, though we have set γ = 0.5 in all three models. Next, we explore the impact of enforcing sparsity and varying the choices of γ. Figure 9 compares the performance in Model 1 under diﬀerent sparsity parameters. When no sparsity is enforced, the neural estimator tend to handles singularities more adeptly, but may overlooks points located on the periphery of the empirical joint distribution, potentially resulting in overly concentrated atoms from the neural estimator (see around x = 0.1, 0.9). Compare Figure 4 and 10 for the extra error due to the lack of enforced sparsity. This phenomenon is more noticeable in Model 3. We refer to panel (2,3) of Figure 18 in Appendix B for an example. Moreover, Figure 19, which is obtained without enforced sparsity, indicates a downgrade in accuracy when compared to Figure 6. For completeness, we present in Table 2 the accuracy of Lip Net across various ϵ values, both with and without enforced sparsity. As demonstrated in the table, the improvement resulting from enforced sparsity is evident. It is worth noting that smaller values of ϵ generally require more iterations to achieve convergence (see, e.g., (Peyr e and Cuturi, 2019, Section 4.2)). Additionally, other hyper-parameters may also signiﬁcantly inﬂuence the training outcomes. Further exploration of these related issues will be conducted in future studies. Finally, it is not recommended to use T ϵ at the early stages of training, as our empirical experiments suggest this could deteriorates performance. In training, we start by not enforcing sparsity and then begin to enforce it in later episodes. We refer to Appendix C for further details of the training conﬁguration.

C. B en ezet, Z. Cheng, and S. Jaimungal

(a) No sparsity enforced

(c) γ = 0.5

Figure 9: Lip Net under Model 1 with diﬀerent sparsity enforcement.

Figure 10: Errors at diﬀerent x s of various estimators under Model 1, Lip Net is trained without enforced sparsity.

We compute the W-distance between estimators and the true conditional distribution at diﬀerent x s. The setting is similar to Figure 4, but Lip Net is trained without enforcing sparsity on the transport plan. The errors of Lip Net at around x = 0.1, 0.9 are slightly higher than those in Figure 4.

Learning conditional distributions on continuous spaces

Table 2 Projected Wasserstein error under diﬀerent Sinkhorn parameters.

Lip Net without enforced sparsity ϵ 1 mean 25%-quantile median 75%-quantile number of Na N steps 10 5.18 2.18 3.99 7.34 0 30 1.43 0.83 1.21 1.78 0 100 1.50 0.96 1.31 1.73 1217 300 9.15 4.97 8.62 12.30 9500 Lip Net with enforced sparsity ϵ 1 mean 25%-quantile median 75%-quantile number of Na N steps 10 1.42 0.91 1.20 1.60 0 30 1.24 0.79 1.07 1.42 0 100 1.52 0.97 1.32 1.76 0 300 5.00 4.96 5.01 5.05 0 Raw estimator k mean 25%-quantile median 75%-quantile 300 2.69 1.62 2.22 3.23 1000 1.57 0.95 1.30 1.91 3000 1.41 0.88 1.21 1.70 10000 2.38 1.45 2.02 2.97

This table displays the evaluation error of Lip Net across various values of ϵ, both with and without enforced sparsity. For comparison, the error of the raw k-nearest-neighbor estimator is also included. The mean and quantiles are reported at a scale of 10 2. Boldfaced numbers indicate the best results, while underlined numbers denote the runner-up. Note that ϵ is only applied after the ﬁrst 500 training episodes. For additional details on the conﬁguration, please refer to Table 4. The evaluation is performed by computing the projected Wasserstein distance, as outlined in Section 3.2.

5.3 More on Lip Net

We will investigates the impact of various hyper-parameters on the performance of Lip Net. The Lip Nets presented in this section are trained with the same hyper-parameters as in Section 3.2 (see also Appendix C), expect for those speciﬁed otherwise.

5.3.1 Activation function

Switching the activation function from ELU to Rectiﬁed Linear Unit (Re LU) appears to retain the adaptive continuity property. In Figure 11, we illustrate the joint distribution and the average absolute derivatives of all atoms of Lip Net with Re LU activation. The outcomes are on par with those achieved using ELU activation as shown in Figure 2.

5.3.2 Value of L in (18)

Note that the Lip Nets discussed in Section 3.2 were trained with L = 0.1. If the normalizing constants in Lip Net are exactly computed, L reﬂects the Lipschitz constant of Lip Net, upto the discrepancy in the choice of norms in diﬀerent layers. The eﬀect of L in our implementation, however, is rather obscure. Figure 12 showcases the performance of Lip Nets across various L values in Model 1. The comparison in Model 2 is presented in Figure 20 in Appendix B. The best choice of L appears to depend on the ground truth model. For Model 3, we compared the performance of L = 0.1 and L = 1 and observed no signiﬁcant diﬀerences. Generally, we prefer a smaller L;

C. B en ezet, Z. Cheng, and S. Jaimungal

(a) Model 1, all atoms

(b) Model 2, ave. abs. der.

(c) Model 2, all atoms

(d) Model 2, ave. abs. der.

Figure 11: Lip Net under Model 1 with Re LU activation.

however, smaller values of L tend to exhibit greater sensitivity to other training parameters. For instance, in Model 3, with L = 0.1, starting enforcing sparsity too soon leads to signiﬁcantly poorer performance, while the impact on the outcomes for L = 1 is much less noticeable.

All atoms scattered

Traj. of 20 atoms

(a) L = 0.01

(b) L = 0.03

Figure 12: Lip Net under Model 1 with various L s.

5.3.3 Momentum τ in Algorithm 2

In our training of Lip Net, we use τ = 10 3. Figure 13 demonstrates the impact of various τ values on neural estimator s performance in Model 1. It is clear that the performance declines with a τ that is too large. While we initially speculated that a smaller τ might cause atoms to exhibit more erratic movements as x changes, observations contradict this hypothesis. We now believe that a suitable τ value helps prevent neurons from stagnating in the plateau region of the ELU activation function. This is supported by the outcomes observed with τ = 10 6, where atom movements are overly simplistic. Additional comparisons in Model 2 are presented in Figure 21. Despite considering as a potential improvement the inclusion of batch normalization in the convex potential layer (16), right after the aﬃne transformation, along with a corresponding oﬀset in the position of W 2, our experiments with both ELU and Re LU activations, using the default

Learning conditional distributions on continuous spaces

All atoms scattered

Traj. of 20 atoms

(a) τ = 10 1

(b) τ = 10 2

(c) τ = 10 5

(d) τ = 10 6

Figure 13: Lip Net under Model 1 with various τ s.

batch normalization momentum of 0.1, resulted in reduced performance. Lowering said batch normalization momentum often leads to a Na N network.

6. Weakness and potential improvement

In this section, we provide some discussion on the weakness and possible improvement of our implementation in Section 3. Extra correction. In more realistic scenarios, the true conditional distribution is often unknown or intractable. In such cases, it is unclear whether a neural estimator oﬀers extra correction over raw estimators. A potential solution to this issue is to train Std Net and Lip Net simultaneously. If Std Net and Lip Net align more closely with each other than with the raw estimator involved in their training, it is possible that the neural estimators are providing extra corrections. Hyper-parameters for Sinkhorn algorithm. Our implementation of the Sinkhorn algorithm involves several hyper-parameters: (i) k in Deﬁnition 9 ; (ii) Natom in (10); (iii) ϵ in (33); (iv) γ in (34); and (v) additional hyper-parameters listed in Section C. The impact of these hyper-parameters is not yet fully understood. Additionally, an adaptive ϵ that balances the accuracy and stability of the Sinkhorn iteration is desirable. Furthermore, as illustrated in Section 5.2, enforcing sparsity on the transport plan generally yields better approximations at x where the conditional distribution is more diﬀusive, but may performs worse where the conditional distribution exhibits atoms. This observation motivates further investigation into a sparsity policy that adjusts according to the indications from the raw estimator. Adaptive continuity. The impact of hyper-parameters in Lip Net also warrants further investigation. In addition, despite the results presented in this study, more evidence is needed to understand how Lip Net and its variations perform under various conditions. Scalability. While the implementation produces satisfactory results when M and k are relatively small (recall that we set Natom = k), our further experiments indicate a scalability bottleneck. For example, in Model 1, signiﬁcantly increasing M and k does not necessarily improve the performance of neural estimators in a comparable manner. For completeness, we provide a report of the computational time consumption under various settings in Section C. To address this issue, we

C. B en ezet, Z. Cheng, and S. Jaimungal

could experiment with varying the ratios between Natoms and k, rather than setting them equal, in hopes of reducing the strain on the Sinkhorn algorithm. We note that varying the ratio between Natoms and k requires adjusting the enforced sparsity accordingly. Another issue relates to the dimensions of X and Y. In view of the curse of dimensionality in Theorem 10, our method is inherently suited for low-dimensional settings. Fortunately, in many practical scenarios, the data exhibits low-dimensional structures, such as: (i) the sampling distribution of X concentrating on a low-dimensional manifold; and (ii) the mapping x 7 Px exhibiting low-dimensional dependence. For (i), we might resort to dimension reduction techniques, although an extension of the results in Section 2 has yet to be established. For (ii), a data-driven method that eﬀectively leverages the low-dimensional dependence is of signiﬁcant interest. Conditional generative models. Utilizing a conditional generative model could potentially lead to further improvements. One advantage of conditional generative models is the ease of incorporating various training objectives. For instance, it can easily adapt to the training objectives in (6) to accommodate multiple diﬀerent hyper-parameters simultaneously. We may also incorporate the joint empirical measure in the training process. This ﬂexibility also allows for the integration of speciﬁc tail conditions as needed. Lastly, we would like to point out an issue observed in our preliminary experiments when utilizing a na ıve conditional generative model: it may assign excessive probability mass to the blank region between two distinct clusters (for example, in Model 1 around (x, y) = (0.1, 0.5)). This possibly stems from the inherent continuity of neural networks. One possible solution is to consider using a mixture of multiple conditional generative models.

Learning conditional distributions on continuous spaces

Appendix A. Numerical ﬁnite-sample bounds

In this section, we present numerical computations of ﬁnite-sample bounds for the r-box and knearest-neighbor estimators in the case where d X = d Y = 3 and C = c = 1, i.e., both the sampling distribution of X and ν are uniform. Through this analysis, we also aim to explore the impact of the Lipschitz constant L and the hyperparameters r and k, as introduced in Sections 2.2 and 2.3. To begin, we note that an explicit expression for R(m), as deﬁned in (21), can be found in (Kloeckner, 2020, Theorem 2.1). Additionally, we introduce a parameter σ to represent the level of uncertainty in Px. This uncertainty parameter is motivated by the observation that R(m) is derived as a bound across all probability measures on [0, 1]d Y. In scenarios where, for each x X, Px is

supported on an cube (possibly dependent on x) of diameter σ

1 d Y (0, 1), we can replace R(m) in (23) and (27) with σR(m). A similar approach applies to other cases, such as small moment conditions. The precise deﬁnition of the uncertainty parameter σ and its data-driven estimation will be explored in future work. Here, we use σ to investigate how the magnitude of uncertainty aﬀects the bounds and the corresponding optimal conﬁgurations. For the r-box estimator, we utilize (23). In the current setting, ξ(B) in (23) simpliﬁes to (2r)d X. Consequently, the bound becomes

E h W(Px, ˆP r x) i 2Lr + σ

1 (2r)d X M m (2r)d Xm R(m) + σ 1 (2r)d X M R(0).

The second term on the right-hand side can be approximated using Monte Carlo sampling from a binomial distribution with n = M and p = (2r)d X. The Monte Carlo approximation is performed with 1, 000 samples. For the k-nearest-neighbor estimator, we combine (27) with (28) and (29) to obtain the following upper bound

E h W Px, ˆµN k(x) i σR(k) + 2L

Table 3 below reports the optimal values and conﬁgurations of r and k based on these bounds, with diﬀerent L and σ. For visualization, these bounds are also plotted in Figure 14.

Table 3 Finite sample bounds with d X = d Y = 3, C = c = 1, M = 106.

σ = 0.1 r-box k-NN L min. bound (10 2) r min. bound (10 2) k

0.1 3.79 0.095 4.49 3548 0.3 6.57 0.055 7.57 707 1 11.00 0.005 11.90 10 σ = 1 r-box k-NN L min. bound (10 2) r min. bound (10 2) k

0.1 12.00 0.3 14.53 112201 0.3 20.78 0.175 24.96 22387 1 37.94 0.095 44.85 3548

C. B en ezet, Z. Cheng, and S. Jaimungal

Figure 14: Finite sample bounds with d X = d Y = 3, C = c = 1, M = 106. Note that the error is capped at 1, as the ﬁnite sample bounds are calculated for unit boxes. For the minimal values, we refer to Table 3.

Learning conditional distributions on continuous spaces

Appendix B. Additional plots

In this section, we present additional plots to further support and complement the discussions provided earlier.

Figure 15: Various estimators under Model 3, projections of conditional CDFs, k = 1000 for k-NN estimator.

This follows the setting of Figure 5, expect that we set k = 1000 for the k-NN estimator plotted here. The neural estimator is trained under the same setting as that in Figure 5 Plot titles display the vectors used for projection. Note the diﬀerence in the x axis scale.

Appendix C. Additional Implementation Details

Table 4 below summarizes hyper-parameters of the neural networks. Additionally, we report the time consumption in Table 5 under various settings. All timings were obtained from a machine equipped with an Nvidia L40 GPU. It is important to note that the computational time does not fully reﬂect the curse of dimensionality, as the ﬂexibility in choosing the hyper-parameter k allows for adjustments in practice. The appropriate selection of k, as discussed in Section A, is a nuanced issue and needs to be carefully re-evaluated based on the speciﬁc ﬁeld information at hand.

C. B en ezet, Z. Cheng, and S. Jaimungal

Figure 16: Various estimators under Model 3, projections of conditional CDFs, k = 3000 for k-NN estimator.

This follows the setting of Figure 5, expect that we set k = 3000 for the k-NN estimator plotted here. The neural estimator is trained under the same setting as that in Figure 5 Plot titles display the vectors used for projection. Note the diﬀerence in the x axis scale.

Learning conditional distributions on continuous spaces

Figure 17: Various estimators under Model 3, projections of conditional CDFs, k = 104 for k-NN estimator.

This follows the setting of Figure 5, expect that we set k = 104 for the k-NN estimator plotted here. The neural estimator is trained under the same setting as that in Figure 5 Plot titles display the vectors used for projection. Note the diﬀerence in the x axis scale.

C. B en ezet, Z. Cheng, and S. Jaimungal

Figure 18: Various estimators under Model 3, projections of conditional CDFs, both Std Net and Lipnet are trained without enforced sparsity.

This follows the setting of Figure 5, expect that we do not enforce sparsity on the transport plan during the training of the neural estimator. Plot titles display the vectors used for projection. Note the diﬀerence in the x axis scale.

Raw k = 1000

Raw k = 3, 000

Figure 19: Histogram of 10, 000 projected Wasserstein-1 errors, without enforced sparsity on transport plan. This follows the setting of Figure 6, expect that we do not enforce sparsity on the transport plan during the training of the neural estimator. Histograms for raw estimators remain the same. The histogram consists of 20 uniformly positioned bins between 0 to 0.1. The errors of diﬀerent estimators are computed with the same set of query points and projection vectors. Errors larger than 0.1 will be placed in the right-most bins.

Learning conditional distributions on continuous spaces

All atoms scattered

Traj. of 20 atoms

(a) L = 0.01

(b) L = 0.03

Figure 20: Lip Net under Model 2 with various L s.

All atoms scattered

Traj. of 20 atoms

(a) τ = 10 1

(b) τ = 10 2

(c) τ = 10 5

(d) τ = 10 6

Figure 21: Lip Net under Model 2 with various τ s.

C. B en ezet, Z. Cheng, and S. Jaimungal

Table 4 Hyper-parameters

Hyper-parameters Conﬁguration Note

Sample size 1e4 for Model 1 & 2, 1e6 for Model 3

k 100 for Model 1 & 2, 300 for Model 3 See Deﬁnition 9

Network stucture Std Net: Layer-wise residual connection He et al. (2016), batch normalization (Ioﬀe and Szegedy (2015)) after aﬃne transformation

Lip Net: Layer-wise residual connection (He et al. (2016)) with convex potential layer (Meunier et al. (2022)) Input dimension d X

Output dimension d Y Natom Natom = k, see (10)

Number of hidden layers 5

Number of neurons each hidden layer 2k k as in Deﬁnition 9

Activation function Std Net: Re LU Lip Net: ELU See Section 5.3.1

L 0.1 See (18)

τ 1e-3 See Algorithm 2

Optimizer Adam (Kingma and Ba (2017)) with learning rate 10 3 Learning rate is 10 2

for Std Net in Model 1 & 2

Batch size 100 for Model 1 & 2 256 for Model 3

Number of episodes 5e3 for Model 1 & 2, 1e4 for Model 3

RBSP setting 25 partition, 8 query points each part See Section 3.1.1

Random bisecting ratio Uniform([0.45, 0.55]) See Section 3.1.1 and Algorithm 3

Ratio for mandatory slicing along the longest edge 5 See Section 3.1.1 and redge in Algorithm 3

Number of Sinkhorn iterations 5, if epoch 500 10, if epoch > 500

ϵ 1, if epoch 100 0.1, if epoch [100, 500] 0.05, if epoch > 500 See (33)

Enforced sparsity Oﬀ, if epoch 500 On, if epoch > 500 See Section 5.2

γ 0.5 See (34)

Learning conditional distributions on continuous spaces

Table 5 Computational times of 500 training steps for Lip Net, with M = 106.

k d = 3 d = 10 30 (24.6, 12.1) (55.7, 16.3) 100 (27.0, 15.9) (56.2, 16.9) 300 (29.2, 18.5) (63.7, 24.7) 1000 (75.9, 64.8) (151.3, 134.2)

This table presents the training time of Lip Net for varying values of k and d X = d Y = d. For other hyper-parameters, please refer to Table 5. All times are reported in seconds. The ﬁrst entry in each cell corresponds to the training time when using exact nearest neighbor search, while the second entry represents the time when employing ANNS-RBSP. The experiments were conducted on a machine equipped with an Nvidia L40 GPU.

Appendix D. Another set of results on ﬂuctuation

D.1 On r-box estimator

Theorem 22 Under Assumptions 2 and 3, and choosing r as in Theorem 7, let ν P(X) be dominated by λX with constant C > 0. Then, there is a constant C > 0 (which depends only on d X, c, C and the constants involved in r), such that, for any ε 0, we have

X W Px, ˆP r x dν(x) E Z

X W Px, ˆP r x dν(x) + ε

2 d X+2 ε2 , d Y = 1, 2,

d X d X+d Y ε2 , d Y 3.

Proof Let ν P(X) as in the statement of the Theorem. We deﬁne

X W(Px, ˆP r x) dν(x),

and introduce the following discrete time ﬁltration: F0 := { , Ω} and Fm := σ(Sm i=1 σ(Xi, Yi)) for m = 1, . . . , M. We consider the Doob s martingale Zm := E [Z | Fm] , m = 1, . . . , M. Note that ZM = Z. We will apply Azuma-Hoeﬀding inequality (cf. (Wainwright, 2019, Corollary 2.20)) to complete the proof. Let us deﬁne

Dm := {(X1, Y1), . . . , (Xm, Ym), (xm+1, ym+1), . . . , (x M, y M)}, m = 1, . . . , M, (35)

D0 := {(xℓ, yℓ)}M ℓ=0, and DM := D. Note that, for all m < M, we have, by Assumptions 3 (i), conditional Fubini-Tonelli theorem, and independent lemma,

(X Y)M m W Px, ˆµDm Br(x) M O

ℓ=m+1 ψ(dxℓdyℓ)ν(dx).

This together with the linearity of integral, the fact that ψ is a probability, and the triangular inequality of W implies that for m = 1, . . . , M,

|Zm Zm 1| Z

(X Y)M m+1 W ˆµDm Br(x), ˆµDm 1 Br(x) M O

ℓ=m ψ(dxℓdyℓ)ν(dx). (36)

C. B en ezet, Z. Cheng, and S. Jaimungal

Notice that, by deﬁnitions (1) and (35), n ˆµDm Br(x) = ˆµDm 1 Br(x) o n Xm Br(x) o n xm Br(x) o . (37)

Additionally, by deﬁnitions (1) and (35) again, on the event that n ˆµDm Br(x) = ˆµDm 1 Br(x) o , we have

W ˆµDm Br(x), ˆµDm 1 Br(x)

ℓ=1 1Br(x)(Xℓ) +

ℓ=m+1 1Br(x)(xℓ)

ℓ=m+1 1Br(x)(xℓ)

Combining (36),(37), (38), and Fubini-Tonelli theorem, we get

|Zm Zm 1| Z

B(Xm,2r) B(xm,2r)

ℓ=m+1 1Br(x)(xℓ)

ℓ=m+1 ξ(dxℓ)ν(dx)ξ(dxm)

B(Xm,2r) B(xm,2r)

ℓ=m+1 1Br(x)(xℓ)

ℓ=m+1 ξ(dxℓ)ν(dx) (39)

where the 2r in the domain of the integral stems from the usage of βr in the deﬁnition of Br (see Deﬁnition 5). Now, for ﬁxed x, xm X, we have

ℓ=m+1 1Br(x)(xℓ)

ℓ=m+1 ξ(dxℓ)

ξ Br(x) ℓ 1 ξr Br(x) M m ℓ (1 + ℓ) 1

= 1 (M m + 1)ξ Br(x)

M m + 1 ℓ+ 1

ξ Br(x) ℓ+1 1 ξ Br(x) M m ℓ

= 1 (M m + 1)ξ Br(x)

ξ Br(x) ℓ 1 ξ Br(x) M m+1 ℓ

= 1 1 ξ Br(x) M m+1

(M m + 1)ξ Br(x) 1 (M m + 1)ξ Br(x) 1 1 (M m + 1)c(2r)d X 1 ,

where we have used Assumption 3 (ii) in the last inequality. Recall C introduced in Theorem 22. In view of (39), we have

|Zm Zm 1| sup xm X

B(Xm,2r) B(xm,2r) 1 (M m + 1)c(2r)d X 1 ν(dx)

2C(4r)d X 1 (M m + 1)c(2r)d X 1

C22d X+1rd X C2d X+1

c(M m + 1) := Cm.

By Azuma-Hoeﬀding inequality (cf. (Wainwright, 2019, Corollary 2.20)), one obtains

P(Z E [Z] ε) exp

2ε2 PM m=1 C2m

Learning conditional distributions on continuous spaces

To complete the proof, we substitute in the conﬁguration of Theorem 7. Since we only aim to investigate the rate of PM m=1 C2 m as M , we simply set

r = M 1 d X+d with d := 2 d Y.

It follows that

d X+d z 2 dz Z M

d X+d dz + Z

d X d X+d z 2 dz M d X d X+d ,

which completes the proof.

D.2 On k-nearest-neighbor estimator

Theorem 23 Under Assumptions 2 and 3, and the choice of k as in Theorem 10, there is a constant C > 0 (which depends only on c and the constants involved in k), such that, for any ν P(X) and ε 0, we have

X W Px, ˇP k x ν(dx) E Z

X W Px, ˇP k x ν(dx) + ε

2 d X+2 ε2 , d Y = 1, 2,

d Y d X+d Y ε2 , d Y 3.

Proof [Proof of Theorem 23] For notational convenience, we will write ˆµD N k(x) for ˆµD N k,D(x). Clearly,

with D = D, we recover ˆµD N k(x) = ˆµD N k,D(x) = ˇP k x . In what follows, we let

X W Px, ˇP k x ν(dx).

We also deﬁne F0 := { , Ω} and Fm := σ(Sm i=1 σ(Xi, Yi)) for m = 1, . . . , M. The proof relies on an application of Azuma-Hoeﬀding inequality (cf. (Wainwright, 2019, Corollary 2.20)) to the Doob s martingale {E [Z|Fm]}M m=0. In order to proceed, we introduce a few more notations:

x := (x1, . . . , x M), X := (X1, . . . , XM),

X m := (X1, . . . , Xm, xm+1, . . . , x M),

Dm := {(X1, Y1), . . . , (Xm, Ym), (xm+1, ym+1), . . . , (x M, y M)},

ηk,x x := the k-th smallest of { xm x }M m=1.

By independence lemma, we have

X W Px, ˆµDm N k(x) ν(dx)

ℓ=m+1 ψ(dxℓdyℓ)

i=1 1 Xi x ηk,Xm x δYi +

ℓ=m+1 1 xℓ x ηk,Xm x δyℓ

ℓ=m+1 ψ(dxℓdyℓ),

C. B en ezet, Z. Cheng, and S. Jaimungal

where we note that (X Y)M m and NM ℓ=m+1 ψ(dxℓdyℓ) in the right hand side can be replaced by (X Y)M m+1 and NM ℓ=m ψ(dxℓdyℓ) as the integrand is constant in xm and ψ is a probability measure. Therefore, by Fubini s theorem and triangle inequality for W, we have E Z Fm E Z Fm 1

(X Y)M m+1W

i=1 1 Xi x ηk,Xm x δYi+

ℓ=m+1 1 xℓ x ηk,Xm x δyℓ

i=1 1 Xi x ηk,Xm 1 x δYi+

ℓ=m 1 xℓ x ηk,Xm 1 x δyℓ

ℓ=m ψ(dxℓdyℓ) dx.

Above, the only diﬀerence between the two measures inside W is the m-th summand. Due to the deﬁnition of W and the boundedness of X, the transport cost induced by altering the m-th summand is at most k 1. It follows that E Z Fm E Z Fm 1 1

k, m = 1, . . . , M. (41)

Below we further reﬁne the upper bound of the absolute diﬀerence in the left hand side of (40) when m = 1, . . . , M k. For the integrand in the right hand side of (40) to be positive, it is necessary that

1 Xm x ηk,Xm x + 1 xm x ηk,Xm 1 x 1.

This, together with the tie breaking rule stipulated in Deﬁnition 9, further implies that

1Am 1 + 1Am 2 1,

Am 1 := n at most (k 1) of xℓ, ℓ= m + 1, . . . , M m, falls into B Xm x x o ,

Am 2 := n at most (k 1) of xℓ, ℓ= m + 1, . . . , M m, falls into B xm x x o .

Combining the above with the reasoning leading to (41), we yield E Z Fm E Z Fm 1

(X Y)M m1Am 1

ℓ=m+1 ξ(dxℓ)ν(dx) + Z

(X Y)M m+11Am 2

ℓ=m ξ(dxℓ)ν(dx)

Above, we have replaced ψ in (40) by ξ because Am 1 and Am 2 no longer depend on yℓ, ℓ= m+1, . . . , M. The analogue applies to the domain of integral as well. We continue to have E Z Fm E Z Fm 1

(X Y)M m1Am 1

ℓ=m+1 ξ(dxℓ)ν(dx) + Z

(X Y)M m+11Am 2

ℓ=m ξ(dxℓ)ν(dx)

k(Im 1 + Im 2 ).

Learning conditional distributions on continuous spaces

Regarding I1 m deﬁned in (42), note that by Assumption 3,

(X Y)M m 1Am 1

ℓ=m+1 ξ(dxℓ) = P h at most (k 1) of ˇX1, . . . , ˇXM m falls into B x x x i x =Xm,

where ˇX1, . . . , ˇXM m i.i.d. ξ. Below we deﬁne a CDF G(r) := crd, r [0, c 1

d ]. By Assumption 3 (ii), for any x, x X, we have Z

X 1ˇx B x x x ξ(dˇx) c

X 1ˇx B x x x dˇx c

X 1ˇx B x x 0 dˇx = G( x x ),

where we have used the fact that x x 1 c 1

d in the last equality. It follows from Lemma 17 that

(X Y)M m 1Am 1

ℓ=m+1 ξ(dxℓ)

G( Xm x )j 1 G( Xm x ) M m j,

and thus, by letting U Uniform(X),

G( Xm x )j 1 G( Xm x ) M m j dx

G( x U )j 1 G( x U ) M m j

where we note that the upper bounded no longer involves ν. For x X, it is obvious that

P x U r P U r , r R,

i.e., U stochastically dominates x U . Note additionally that, by Lemma 17 again, below is a non-decreasing function,

G(r)j 1 G(r) M m j.

Consequently,

G( U )j 1 G( U ) M m j dx

Since U has CDF r 7 rd X, r [0, 1] and G(r) = crd, r [0, c 1

d ], we continue to obtain

r=0 crd Xj(1 crd X)M m j drd X

(M m)! j!(M m j)!

0 rj(1 r)M m j dr.

C. B en ezet, Z. Cheng, and S. Jaimungal

With a similar calculation as in (26), which involves beta distribution and gamma function, we arrive at

(M m)! j!(M m j)! j!(M m j)!

(M m + 1)! c 1k M m. (43)

Regarding Im 2 deﬁned in (42), we ﬁrst let ˇX0, ˇX1, . . . , ˇXM m i.i.d. ξ. Then, note that

(X Y)M m 1Am 2

ℓ=m+1 ξ(dxℓ) P h at most (k 1) of ˇX1, . . . , ˇXM m falls into B ˇ X0 x x i

where the inequality in the second line is due to the symmetry stemming from Assumption 3 (i), and the fact that congestion along with the tie-breaking rule speciﬁed in Deﬁnition 9 may potentially rules out certain permutations. Consequently,

Im 2 k M m. (44)

Putting together (41), (42), (43), and (44), we yield

E Z Fm E Z Fm 1 Cm := C(c 1 + 1)

k, m = 1, . . . , M.

By Azuma-Hoeﬀding inequality (cf. (Wainwright, 2019, Corollary 2.20)),

X W Px, ˆµD N k(x) ν(dx) E Z

X W Px, ˆµD N k(x) ν(dx) ε exp

2 PM m=1 C2m

To complete the proof, we substitute in the conﬁguration of Theorem 10. Below we only investigate the rate of PM m=1 C2 m as M , and do not keep track of the constant. For simplicity, we set

d d X+d with d := 2 d Y

It follows that

M M d d X+d X

1 (M m)2 + M

M d d X+d 1 r2 dr + 1

d d X+d M d d X+d ,

which completes the proof.

Learning conditional distributions on continuous spaces

B. Acciaio and S. Hou. Convergence of adapted empirical measures on Rd. ar Xiv:2211.10162, 2023.

B. Acciaio, A. Kratsios, and G. Pammer. Designing universal causal deep learning models: The geometric (hyper)transformer. Mathematical Finance, 34:671 735, 2024.

C. D. Aliprantis and K. C. Border. Inﬁnite Dimensional Analysis: A Hitchhiker s Guide. Springer Verlag Berlin Heidelberg, 2006.

F. Altekr uger, P. Hagemann, and G. Steidl. Conditional generative models are provably robust: Pointwise guarantees for bayesian inverse problems. Transactions on Machine Learning Research, 2023.

A. Araujo, A. J. Havens, B. Delattre, A. Allauzen, and B. Hu. A uniﬁed algebraic perspective on lipschitz neural networks. The Eleventh International Conference on Learning Representations, 2023.

J. Backhoﬀ, D. Bartl, M. Beiglb ock, and J. Wiesel. Estimating processes in adapted wasserstein distance. Annals of Applied Probability, 32:529 550, 2022.

T. Bai, J. Luo, Jun Zhao, B. Wen, and Qian Wang. Recent advances in adversarial training for adversarial robustness. Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence, 2021.

P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30, 2017.

D. M. Bashtannyk and R. J. Hyndman. Bandwidth selection for kernel conditional density estimation. Computational Statistics & Data Analysis, 36:279 298, 2001.

K. Bertin, C. Lacour, and V. Rivoirard. Adaptive pointwise estimation of conditional density function. Ann. Inst. H. Poincar e Probab. Statist., 52:939 980, 2016.

P. K. Bhattacharya and A. K. Gangopadhyay. Kernel and nearest-neighbor estimation of a conditional quantile. The Annals of Statistics, 18:1400 1415, 1990.

A. Bhowmick, M. D Souza, and G. S. Raghavan. Lipbab: Computing exact lipschitz constant of relu networks. International Conference on Artiﬁcial Neural Networks, 2021.

G. Biau and L. Devroye. Lectures on the Nearest Neighbor Method. Springer, 2015.

V. I. Bogachev. Measure Theory Volume II. Springer-Verlag Berlin Heidelberg, 2007.

J. Booth, P. Hall, and A. Wood. Bootstrap estimation of conditional distributions. The Annals of

Statistics, 20:1594 1610, 1992.

P. Bountakas, A. Zarras, A. Lekidis, and C. Xenakis. Defense strategies for adversarial machine learning: A survey. Computer Science Review, 49, 2023.

Z. Cheng and S. Jaimungal. Distributional dynamic risk measures in markov decision processes.

ar Xiv:2203.09612, 2023.

Y. Chow, A. Tamar, S. Mannor, and M. Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach. Advances in Neural Processing Systems, 2015.

C. B en ezet, Z. Cheng, and S. Jaimungal

D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). ar Xiv:1511.07289, 2016.

Anthony Coache, Sebastian Jaimungal, and Alvaro Cartea. Conditionally elicitable dynamic risk measures for deep reinforcement learning. SIAM Journal on Financial Mathematics, 14(4):1249 1289, 2023.

J. Cohen, E. Rosenfeld, and Z. Kolter. Certiﬁed adversarial robustness via randomized smoothing.

Proceedings of the 36th International Conference on Machine Learning, 97, 2019.

E. Demirkayaa, Y. Fan, L. Gao, J. Lv, P. Vossler, and J. Wang. Optimal nonparametric inference with two-scale distributional nearest neighbors. Journal of the American Statistical Association, 199:297 307, 2024.

L. Devroye. Necessary and suﬃcient conditions for the pointwise convergence of nearest neighbor regression function estimates. Probability Theory and Related Fields, 61:467 481, 1982.

R. M. Dudley. The speed of mean glivenko-cantelli convergence. The Annals of Mathematical Statistics, 40(1):40 50, 1969.

J. Fan. Design-adaptive nonparametric regression. Journal of the American Statistical Association, 87:998 1004, 1992.

M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. Pappas. Eﬃcient and accurate estimation of lipschitz constants for deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.

M. Fazlyab, T. Entesari, A. Roy, and R. Chellappa. Certiﬁed robustness via dynamic margin maximization and improved lipschitz regularization. Advances in Neural Information Processing Systems, 36, 2024.

F. Ferraty and P. Vieu. Nonparametric Functional Data Analysis: Theory and Practics. Springer, 2006.

E. Fetaya, J., W. Grathwohl, and R. Zemel. Understanding the limitations of conditional generative models. International Conference on Learning Representations, 2020.

E. Fix and J. L. Hodges. Discriminatory analysis; nonparametric discrimination, consistency properties. USAF SAM Series in Statistics, Project No. 21-49-004, 1951.

N. Fournier. Convergence of the empirical measure in expected wasserstein distance: non-asymptotic explicit bounds in Rd. ESAIM: Probability and Statistics, 27:749 775, 2023.

Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162:707 738, 2015.

T. Gasser and H. M uller. Kernel estimation of regression functions. Smoothing Technique for Curve

Estimation, pages 23 63, 1979.

M. Giegrich, R. Oomen, and C. Reisinger. K-nearest-neighbor resampling for oﬀ-policy evaluation in stochastic control. ar Xiv:2306.04836, 2024.

I. Gijbels and A. Goderniaux. Bandwidth selection for changepoint estimation in nonparametric regression. Technometrics, 46:76 86, 2004.

Learning conditional distributions on continuous spaces

H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110:393 416, 2021.

L. Gy orﬁ, M. Kohler, A. Krzy zak, and H. Walk. A Distribution-Free Theory of Nonparametric

Regression. Springer, 2002.

K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, and H. Zhang. Fast approximate nearest-neighbor search with k-nearest neighbor graph. Proceedings of the Twenty-Second International Joint Conference on Artiﬁcial Intelligence, 2011.

P. Hall and J. D. Hart. Nonparametric regression with long-range dependence. Stochastic Processes

and their Applications, 36:339 351, 1990.

P. Hall, R. C. L. Wolﬀ, and Q. Yao. Methods for estimating a conditional distribution function.

Journal of the American Statistical Association, 94:154 163, 1999.

W. Hardle and J. S. Marron. Optimal bandwidth selection in nonparametric regression function estimation. The Annals of Statistics, 13:1465 1481, 1985.

K. He, X. Zhang, S., and J. Sun. Deep residual learning for image recognition. Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016.

Ruiyang Hong and Anastasis Kratsios. Bridging the gap between approximation and learning via optimal approximation by relu mlps of maximal regularity. ar Xiv:2409.12335, 2024.

B. Hosseini, A. W. Hsu, and A. Taghvaei. Conditional optimal transport on function spaces. ar Xiv:2311.05672, 2024.

S. Hou. Convergence of the adapted smoothed empirical measures. ar Xiv:2401.14883, 2024.

W. Huang and W. B. Haskell. Risk-aware q-learning for markov decision processes. IEEE 56th

Annual Conference on Decision and Control, 2017.

Y. Huang, H. Zhang, Y. Shi, J. Z. Kolter, and A. Anandkumar. Training certiﬁably robust neural networks with eﬃcient local lipschitz bounds. Advances in Neural Information Processing Systems, 34, 2021.

S. Ioﬀe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning, pages 448 456, 2015.

M. Jordan and A. G. Dimakis. Exactly computing the local lipschitz constant of relu networks.

Advances in Neural Information Processing Systems, 33, 2021.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv:1412.6980, 2017.

Benoˆıt R. Kloeckner. Empirical measures: regularity is a counter-curse to dimensionality. ESAIM:

Probability and Statistics, 24:408 434, 2020.

M. Kohler, A. Krzy zak, and H. Walk. Optimal global rates of convergence for nonparametric regression with unbounded data. Journal of Statistical Planning and Inference, 193:1286 1296, 2009.

C. B en ezet, Z. Cheng, and S. Jaimungal

M. Kohler, A. Krzy zak, and H. Walk. Uniform convergence rate for nonparametric regrssion and principle component analysis with functional/longtitudeinal data. The Annals of Statistics, 38: 3321 3351, 2010.

M. K ohler, A. Schindler, and S. Sperlich. A review and comparison of bandwidth selection methods for kernel regression. International Statistical Review, 82:243 274, 2014.

A. Kratsios. Universal regular conditional distributions via probabilistic transformers. Constructive

Approximation, 57:1145 1212, 2023.

C. Lacour. Adaptive pointwise estimation of conditional density function. Ann. Inst. H. Poincar e

Probab. Statist., 43:571 597, 2007.

Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. Approximate nearest neighbor search on high dimensional data experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering, 32:1475 1488, 2020a.

Y. Li, S. Akbar, and J. Oliva. Acﬂow: Flow models for arbitrary conditional likelihoods. Proceedings

of the 37th International Conference on Machine Learning, 119:5831 5841, 2020b.

H. D. Liu, F. Williams, A. Jacobson, S. Fidler, and O. Litany. Learning smooth neural functions via lipschitz regularization. SIGGRAPH 22: Special Interest Group on Computer Graphics and Interactive Techniques Conference, 2022.

Y. P. Mack. Local properties of k-nn regression estimates. SIAM Journal on Algebraic Discrete

Methods, 2:311 323, 1981.

F. Mart ınez, M. P. Fr ıas, M. D. P erez, and A. J. Rivera. A methodology for applying k-nearest neighbor to time series forecasting. Artiﬁcial Intelligence Review, 52:2019 2037, 2017.

Milan Merkle. Inequalities for the gamma function via convexity. Advances in Inequalities for Special Functions, pages 81 100, 2008.

L. Meunier, B. J. Delattre, A. Araujo, and A. Allauzen. A dynamical system perspective for lipschitz neural networks. International Conference on Machine Learning, 165, 2022.

M. Mirza and S. Osindero. Conditional generative adversarial nets. ar Xiv:1411.1784, 2014.

K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Sch olkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10:1 144, 2017.

E. A. Nadaraya. On estimating regression. Theory of Probability and Its Applications, 1964.

Bao Nguyen, Binh Nguyen, Hieu Trung Nguyen, and Viet Anh Nguyen. Generative conditional distributions by neural (entropic) optimal transport. Proceedings of the 41st International Conference on Machine Learning, 235:37761 37775, 2024.

O. H. M. Padilla, J. Sharpnack, Y. Chen, and D. M. Witten. Adaptive nonparametric regression with the k-nearest neighbour fused lasso. Technometrics, 107:293 310, 2020.

G. Papamakarios, T. Pavlakou, and Iain Murray. Masked autoregressive ﬂow for density estimation.

Advances in Neural Information Processing Systems, 30, 2017.

Learning conditional distributions on continuous spaces

P. Pauli, A. Koch, J. Berberich, P. K ohler, and F. Allg ower. Training robust neural networks using lipschitz bounds. IEEE Control Systems Letters, 6:121 126, 2022.

Gabriel Peyr e and Marco Cuturi. Computational Optimal Transport: With Applications to Data

Science. Foundations and Trends in Machine Learning, 2019.

G. Ch. Pﬂug and A. Pichler. From empirical observations to tree models for stochastic optimization: Convergence properties. SIAM Journal on Optimization, 26:1715 1740, 2016.

M. Rachdi, A. Laksaci, Z. Kaid, A. Benchiha, and F. A. Al-Awadhi. k-nearest neighbors local linear regression for functional and missing data at random. Statistica Neerlandica, 75, 2021.

P. Ram and K. Sinha. Revisiting kd-tree for nearest neighbor search. KDD 19: Proceedings of

the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, page 1378 1388, 2019.

D. Rudolf and N. Schweizer. Perturbation theory for markov chains via wasserstein distance. Bernoulli, 24:2610 2639, 2018.

J. J. Ryu and Y. Kim. Minimax regression via adaptive nearest neighbor. ar Xiv:2202.02464, pages 1447 1451, 2022.

D. Scott. Multivariate Density Estimation: Theory, Practics, and Visualization. Wiley, 2015.

D. Shah and Q. Xie. Q-learning with nearest neighbors. Advances in Neural Information Processing

Systems, 31, 2010.

Z. Shi, Y. Wang, H. Zhang, J. Z. Kolter, and Cho-Jui Hsieh. Eﬃciently computing local lipschitz constants of neural networks via bound propagation. Advances in Neural Information Processing Systems, 35, 2022.

J. S. Simonoﬀ. Smoothing Methods in Statistics. Springer, 1996.

S. Singla, S. Singla, and S. Feizi. Improved deterministic l2 robustness on cifar-10 and cifar-100.

The Tenth International Conference on Learning Representations, 2022.

C. J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of

Statistics, 10:1040 1053, 1982.

A. Trockman and J. Z. Kolter. Orthogonalizing convolutional layers with the cayley transform.

International Conference on Learning Representations, 36, 2021.

Y. Tsuzuku, I. Sato, and M. Sugiyama. Lipschitz-margin training: Scalable certiﬁcation of perturbation invariance for deep neural networks. Advances in Neural Information Processing Systems, 31, 2018.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.

C edric Villani. Optimal transport: Old and new, volume 338. Springer Science & Business Media, 2008.

C. B en ezet, Z. Cheng, and S. Jaimungal

A. Virmaux and K. Scaman. Lipschitz regularity of deep neural networks: analysis and eﬃcient estimation. Advances in Neural Information Processing Systems, 31, 2018.

M. ˇSm ıd and V. Kozm ık. Approximation of multistage stochastic programming problems by smoothed quantization. Review of Managerial Science, 2024.

M. Vuleti c, F. Prenzel, and M. Cucuringu. Fin-gan: forecasting and classifying ﬁnancial time series via generative adversarial networks. Quantitative Finance, 24, 2024.

M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.

R. Wang and I. Manchester. Direct parameterization of lipschitz-bounded deep networks. Proceedings of the 40th International Conference on Machine Learning, 202, 2023.

L. Wasserman. All of Nonparametric Statistics. Springer, 2006.

G. S. Watson. Smooth regression analysis. Sankhya: The Indian Journal of Statistics, Series A, 26: 359 372, 1964.

T. Xu and B. Acciaio. Conditional cot-gan for video prediction with kernel smoothing. Neur IPS

2022 Workshop on Robustness in Sequence Modeling, 2022.

A. Xue, L. Lindemann, A. Robey, H. Hassani, G. J. Pappas, and R. Alur. Chordal sparsity for lipschitz constant estimation of deep neural networks. 2022 IEEE 61st Conference on Decision and Control, 6:3389 3396, 2022.

B. Zhang, D. Jiang, D. He, and L. Wang. Rethinking lipschitz neural networks and certiﬁed robustness: A boolean function perspective. Advances in Neural Information Processing Systems, 35, 2022.

P. Zhao and L. Lai. Minimax regression via adaptive nearest neighbor. 2019 IEEE International

Symposium on Information Theory, pages 1447 1451, 2019.

W. Zhao and E. G. Tabak. Adaptive kernel conditional density estimation. 2023.