# neural_conditional_probability_for_uncertainty_quantification__3fab5416.pdf Neural Conditional Probability for Uncertainty Quantification Vladimir R. Kostic1,2 Karim Lounici3 Gr egoire Pacreau3 Giacomo Turri1 Pietro Novelli1 Massimiliano Pontil1,4 1CSML, Istituto Italiano di Tecnologia 2University of Novi Sad 3CMAP-Ecole Polytechnique 4AI Centre, University College London We introduce Neural Conditional Probability (NCP), an operator-theoretic approach to learning conditional distributions with a focus on statistical inference tasks. NCP can be used to build conditional confidence regions and extract key statistics such as conditional quantiles, mean, and covariance. It offers streamlined learning via a single unconditional training phase, allowing efficient inference without the need for retraining even when conditioning changes. By leveraging the approximation capabilities of neural networks, NCP efficiently handles a wide variety of complex probability distributions. We provide theoretical guarantees that ensure both optimization consistency and statistical accuracy. In experiments, we show that NCP with a 2-hidden-layer network matches or outperforms leading methods. This demonstrates that a a minimalistic architecture with a theoretically grounded loss can achieve competitive results, even in the face of more complex architectures. 1 Introduction This paper studies the problem of estimating the conditional distribution associated with a pair of random variables, given a finite sample from their joint distribution. This problem is fundamental in machine learning, and instrumental for various purposes such as building prediction intervals, performing downstream analysis, visualizing data, and interpreting outcomes. This entails predicting the probability of an event given certain conditions or variables, which is a crucial task across various domains, ranging from finance (Markowitz, 1958) to medicine (Ray et al., 2017), to climate modeling (Harrington, 2017) and beyond. For instance, in finance, it is essential for risk assessment to estimate the probability of default given economic indicators. Similarly, in healthcare, predicting the likelihood of a disease, given patient symptoms, aids in diagnosis. In climate modeling, estimating the conditional probability of extreme weather events such as hurricanes or droughts, given specific climate indicators, helps in disaster preparedness and mitigation efforts. According to Gao and Hastie (2022), there exist four main strategies to learn the conditional distribution. The first one relies on the Bayes formula for densities and proposes to apply non-parametric statistics to learn the joint and marginal densities separately. However, most of non-parametric techniques face a significant challenge known as the curse of dimensionality (Scott, 1991; Nagler and Czado, 2016). The second strategy, also known as Localization method, involves training a model unconditionally on reweighted samples, where weights are determined by their proximity to the desired conditioning point (Hall et al., 1999; Yu and Jones, 1998). These methods require retraining the model whenever the conditioning changes and may also suffer from the curse of dimensionality if the weighting strategy treats all covariates equally. The third strategy, known as Direct Learning of the conditional distribution involves finding the best linear approximation of the conditional density on a dictionary of base functions or a kernel space (Sugiyama et al., 2010; Li et al., 2007). The 38th Conference on Neural Information Processing Systems (Neur IPS 2024). performance of these methods relies crucially on the selection of bases and kernels. Again for high-dimensional settings, approaches that assign equal importance to all covariates may be less effective. Finally, the fourth strategy, known as Conditional Training, involves training models to estimate a target variable conditioned on certain covariates. This is typically based on partitioning the covariates space X into sets, followed by training models unconditionally within each partition (see Gao and Hastie, 2022; Winkler et al., 2020; Lu and Huang, 2020; Dhariwal and Nichol, 2021, and references therein). However, this strategy requires a large dataset to provide enough samples for each conditioning and is expensive as it requires training separate models for each conditioning input set, even though they stem from the same underlying joint distribution. Contributions The principal contribution of this work is a different conditional probability approach that does not fall into any of the four aforementioned strategies. Rather than learning the conditional density directly, our method, called Neural Conditional Probability (NCP), aims to learn the conditional expectation operator EY |X associated to the random variables X X and Y Y based on data from their joint distribution. The operator is defined, for every measurable function f : Y R, as [EY |Xf](x) := E[f(Y ) | X = x]. NCP is based on a principled loss, leveraging the connection between conditional expectation operators and deep CCA (Andrew et al., 2013) established in (Kostic et al., 2024), and can be used interchangeably to: (a) retrieve the conditional density p Y |X with respect to marginal distributions of X and Y ; (b) compute conditional statistics E[f(Y ) | X] for arbitrary functions f : Y R, including conditional mean, variance, moments, and the conditional cumulative distribution function, thereby providing access to all conditional quantiles simultaneously; (c) estimate the conditional probabilities P[Y B | X A] for arbitrary sets B Y and A X with theoretical non-asymptotic guarantees on accuracy, allowing us to easily construct conditional confidence regions. Notably, our approach extracts statistics directly from the trained operator without retraining or resampling, and it is supported by both optimization consistency and statistical guarantees. In addition our experiments show that our approach matches or exceeds the performance of leading methods, even when using a basic a 2-hidden-layer network. This demonstrates the effectiveness of a minimalistic architecture combined with a theoretically grounded loss function. Paper organization In Section 2 we review related work. Section 3 introduces the operator theoretic approach to model conditional expectation, while Section 4 discusses its training pipeline. In Section 5, we derive learning guarantees for NCP. Finally, Section 6 presents numerical experiments. 2 Related works Non-parametric estimators are valuable for density and conditional density estimation as they don t rely on specific assumptions about the density being estimated. Kernel estimators, pioneered by Parzen (1962) and Rosenblatt (1956), are a widely used non-parametric density estimation method. Much effort has been dedicated to enhancing kernel estimation, focusing on aspects like bandwidth selection (Goldenshluger and Lepski, 2011), non-linear aggregation (Rigollet and Tsybakov, 2007), and computational efficiency (Langren e and Warin, 2020), as well as extending it to conditional densities (Bertin et al., 2014). A comprehensive review of kernel estimators and their variants is provided in (Silverman, 2017). See also (Tsybakov, 2009) for a statistical analysis of their performance. However, most of non-parametric techniques face a significant challenge known as the curse of dimensionality (Scott, 1991; Nagler and Czado, 2016), meaning that the required sample size for accurate estimation grows exponentially with the dimensionality of the data (Silverman, 2017). Additionally, the computational complexity also increases exponentially with dimensionality (Langren e and Warin, 2020). Examples of localization methods include the work by Hall et al. (1999) for conditional CDF estimation using local logistic regression and locally adjusted Nadaraya-Watson estimation, as well as conditional quantiles estimation via local pinball loss minimization in (Yu and Jones, 1998). Examples of direct learning of the conditional distribution include (Sugiyama et al., 2010) via decomposition on a dictionary of base functions. Similarly, Li et al. (2007) explores quantile regression in reproducing Hilbert kernel spaces. Conditional training is a popular approach which was adopted in numerous works, as in the recent work by Gao and Hastie (2022) where a parametric exponential model for the conditional density pθ(y|x) is trained using the Lindsey method within each bin of a partition of the space X. This strategy has also been implemented in several prominent classes of generative models, including Normalizing Flow (NF) and Diffusion Models (DM) (Tabak and Vanden-Eijnden, 2010; Dinh et al., 2014; Rezende and Mohamed, 2015a; Sohl-Dickstein et al., 2015). These models work by mapping a simple probability distribution into a more complex one. Conditional training approaches for NF and DM have been developed in many works including (e.g. Winkler et al., 2020; Lu and Huang, 2020; Dhariwal and Nichol, 2021). In efforts to lower the computational burden of conditional diffusion models, an alternative approach used heuristic approximations applied directly to unconditional diffusion models on computer vision related tasks (see e.g. Song et al., 2023; Zhang et al., 2023). However, the effectiveness of these heuristics in accurately mimicking the true conditional distributions remains uncertain. Another crucial aspect of these classes of generative models is that while the probability distribution is modelled explicitly, the computation of any relevant statistic, say E[Y |X] is left as an implicit problem usually solved by sampling from pθ(y|x) and then approximating E[Y |X] via simple Monte-Carlo integration. As expected, this approach quickly becomes problematic as the dimension of the output space Y becomes large. Conformal Prediction (CP) is a popular model-agnostic framework for uncertainty quantification (Vovk et al., 1999). Conditional Conformal Prediction (CCP) was later developed to handle conditional dependencies between variables, allowing in principle for more accurate and reliable predictions (see Lei and Wasserman, 2014; Romano et al., 2019; Chernozhukov et al., 2021; Gibbs et al., 2023, and the references cited therein). However, (CP) and (CCP) are not without limitations. The construction of these guaranteed prediction regions need to be recomputed from scratch for each value of the confidence level parameter and of the conditioning for (CCP). In addition, the produced confidence regions tend to be conservative. 3 Operator approach to probability modeling Consider a pair of random variables X and Y taking values in probability spaces (X, ΣX , µ) and (Y, ΣY, ν), respectively, where X and Y are state spaces, ΣX and ΣY are sigma algebras, and µ and ν are probability measures. Let ρ be the joint probability measure of (X, Y ) from the product space X Y. We assume that ρ is absolutely continuous w.r.t. to the product measure of its marginals, that is ρ µ ν, and denote the corresponding density by p = dρ/d(µ ν), also called point-wise dependency in Tsai et al. (2020), so that ρ(dx, dy) = p(x, y)µ(dx)ν(dy). The principal goal of this paper is, given a dataset Dn := (xi, yi)i [n] of observations of (X, Y ), to estimate the conditional probability measure p(B | x) := P[Y B | X = x], x X, B ΣY. (1) Our approach is based on the simple fact that p(B | x) = E[1B(Y ) | X = x], where 1B denotes the characteristic function of set B. More broadly we address the above problem by studying the conditional expectation operator EY |X : L2 ν(Y) L2 µ(X), which is defined, for every f L2 ν(Y) and x X, as [EY |Xf](x) := E[f(Y ) | X = x] = Z Y f(y)p(dy | x) = Z Y f(y)p(x, y)ν(dy), where L2 µ(X) and L2 ν(Y) denotes the Hilbert spaces of functions that are square integrable w.r.t. to µ and ν, respectively. One readily verifies that EY |X = 1 and EY |X1Y = 1X . A prominent feature of the above operator is that its rank can reveal the independence of the random variables. That is, X and Y are independent random variables if and only if EY |X is a rank one operator, in which case we have that EY |X = 1X 1Y. It is thus useful to consider the deflated operator DY |X = EY |X 1X 1Y : L2 ν(Y) L2 µ(X), for which we have that [EY |Xf](x) = E[f(Y )] + [DY |Xf](x), f L2 ν(Y). (2) For dependent random variables, the deflated operator is nonzero. In many important situations, such as when the conditional probability distribution is a.e. absolutely continuous w.r.t. to the target measure, that is p( | x) ν for µ-a.e. x X, the operator EY |X is compact, and, hence, we can write the SVD of EY |X and DY |X respectively as i=0 σ i u i v i , and DY |X = i=1 σ i u i v i , (3) where the left (u i )i N and right (v i )i N singular functions form complete orthonormal systems of L2 µ(X) and L2 ν(Y), respectively. Notice that the only difference in the SVD of EY |X and DY |X is the extra leading singular triplet (σ 0, u 0, v 0) = (1, 1µ, 1ν) of EY |X. In terms of densities, the SVD of EY |X leads to the characterization p(x, y) = P i=0σ i u i (x) v i (y) = 1 + P i=1σ i u i (x) v i (y). The mild assumption that EY |X is a compact operator allows one to approximate it arbitrarily well with a (large enough) finite rank (empirical) operator. Choosing the operator norm as the measure of approximation error and appealing to the Eckart-Young-Mirsky Theorem (see Theorem 3 in Appendix B.1) one concludes that the best approximation is given by the truncated SVD, that is for every d N, DY |X [[DY |X]]d := Pd i=1σ i u i v i , and [[DY |X]]d arg minrank(A) d DY |X A , (4) where the minimum is given by σ d, and the minimizer is unique whenever σ d+1 < σ d. This leads to the approximation of the joint density w.r.t. marginals p(x, y) 1 + Pd i=1 σ i u i (x) v i (y), so that E[f(Y ) | X = x] E[f(Y )] + Pd i=1 σ i u i (x)E[f(Y ) v i (Y )], (5) which in particular, choosing f = 1B, gives P[Y B | X = x] P[Y B] + Pd i=1 σ i u i (x) E[v i (Y ) 1B(Y )]. Moreover, we have that P[Y B | X A] = 1A, EY |X1B P[X A] P[Y B] + i=1 σ i E[u i (X)1A(X)] P[X A] E[v i (Y ) 1B(Y )], for which the approximation error is bounded in the following lemma. Lemma 1 (Approximation bound). For any A ΣX such that P[X A] > 0 and any B ΣY, P[Y B | X A] P[Y B] 1A, [[DY |X]]d1B P[Y B] P[X A]. (6) Neural network model Inspired by the above observations, to build the NCP model, we will parameterize the truncated SVD of the conditional expectation operator and then learn it. Specifically, we introduce two parameterized embeddings uθ : X Rd and vθ : Y Rd, and the singular values parameterized by wθ Rd, respectively given by uθ(x):=[uθ 1(x) . . . uθ d(x)] , vθ(y):=[vθ 1(y) . . . vθ d(y)] , and σθ:=[e (wθ 1)2, . . . , e (wθ d)2] , where the parameter θ takes values in a prescribed set Θ. We then aim to learn the joint density function p(x, y) in the (separable) form pθ(x, y) := 1 + P i [d]σθ i uθ i (x) vθ i (y) = 1 + σθ uθ(x), vθ(y) , where denotes element-wise product. One of the prominent losses considered for the task of learning p L2 µ ν is the least squares density ratio loss Eµ ν(p pθ)2 Eρp = Eµ νp2 θ 2Eρpθ, c.f. Tsai et al. (2020), also considered by Hao Chen et al. (2022) in the specific context of augmentation graph in self-supervised deep learning, linked to kernel embeddings (Wang et al., 2022), and rediscovered and tested on Deep CCA tasks by Wells et al. (2024). Here, following the operator perspective, we use the characterization (4) of the optimal finite rank model to propose a new loss that: (1) excludes the known feature from the learning process, and (2) introduces a penalty term to enforce orthonormality of the basis functions. More precisely, our loss Lγ(θ) := L(θ) + γR(θ) is composed of two terms. The first one L(θ) := Eµ νp2 θ 2Eρpθ + [Eµ νpθ]2 Eµ[Eνpθ]2 Eν[Eµpθ]2 + 2Eµ νpθ (7) is equivalent to solving (4) with A = Pd i=1 σθ i [uθ i Eµuθ i ] [vθ i Eνvθ i ] and can be written in terms of correlations between features. Namely, denoting the covariance and variance matrices by Cov[z, z ] := E[(z E[z])(z E[z ]) ] and Var[z] := E[(z E[z])(z E[z]) ], (8) and abbreviating uθ := uθ(X) and vθ := vθ(Y ) for simplicity, we can write L(θ) := tr Var[ σθ uθ] Var[ σθ vθ] 2 Cov[ σθ vθ] . (9) If p=pθ for some θ Θ, then the optimal loss is the χ2-divergence L(θ)=Dχ2(ρ | µ ν)= P and, as we show below, L(θ) measures how well pθ(x, y) 1 approximates P i [d]σ i u i (x) v i (y). However, in order to obtain a useful probability model, it is of paramount importance to align the metric in the latent spaces with the metrics in the data-spaces L2 µ(X) and L2 ν(Y). For different reasons, a similar phenomenon has been observed in Kostic et al. (2024) where dynamical systems are learned via transfer operators. In our setting, this leads to the second term of the loss that measures how well features uθ and vθ span relevant subspaces in L2 µ(X) and L2 ν(Y), respectively. Namely, aiming E[u i (X)u j(X)] = E[v i (Y )v j (Y )] = 1{i=j}, i, j {0, 1, . . . , d} leads to R(θ):= E[uθ(X)uθ(X) ] I 2 F + E[vθ(Y )vθ(Y ) ] I 2 F +2 E[uθ(X)] 2+2 E[vθ(Y ) 2. (10) We now state our main result on the properties of the loss Lγ, which extends the result in Wells et al. (2024) to infinite-dimensional operators and guarantees the uniqueness of the optimum due to R. Theorem 1. Let EY |X : L2 ν(Y) L2 µ(X) be a compact operator and DY |X = P i=1 σ i u i v i be the SVD of its deflated version. If uθ i L2 µ(X) and vθ i L2 ν(Y), for all θ Θ and i [d], then for every θ Θ, Lγ(θ) P i [d] σ 2 i . Moreover, if γ > 0 and σ d > σ d+1, then the equality holds if and only if (σθ i , uθ i , vθ i ) equals (σ i , u i , v i ) ρ-a.e., up to unitary transform of singular spaces. We provide the proof in Appendix B.3. In the following section, we show how to learn these canonical features from data and construct approximations of the conditional probability measure. Comparison to previous methods NCP does not fall into any of the four categories defined by Gao and Hastie (2022), as it does not aim to learn conditional density of Y |X directly. Instead, NCP focuses on learning the operator mapping L2 ν(Y) L2 µ(X), from which all relevant task-specific statistics can be derived without requiring retraining. This approach effectively integrates with deep representation learning to create a latent space adapted to p(y|x). As a result, NCP efficiently captures the intrinsic dimension of the data, which is supported by our theoretical guarantees that depend solely on the latent space dimension (Theorem 2). In contrast, strategies designed for learning density often encounter significant limitations, such as the curse of dimensionality, potential substantial misrepresentation errors when the pre-specified function dictionary misaligns with the true distribution p(y|x), and high computational complexity due to the need for retraining. Experiments confirm NCP s capability to learn representations tailored to a wide range of data types including manifolds, graphs, and high-dimensional distributions without relying on predefined dictionaries. This flexibility allows NCP to outperform popular aforementioned methods. 4 Training the NCP inference method In this section, we discuss how to train the model. Given a training dataset Dn = (xi, yi)i [n] and networks (uθ, vθ, σθ), we consider the empirical loss b Lγ(θ) := b L(θ)+γ b R(θ), where we replaced (9) and (10) by their empirical versions. In order to guarantee the unbiased estimation, as we show within the proof of Theorem 1, two terms of our loss can be written using two independent samples (X, Y ) and (X , Y ) from ρ as L(θ)=E[L(uθ(X) Euθ(X), uθ(X ) Euθ(X ), vθ(Y ) Evθ(Y ), vθ(Y ) Evθ(Y ), σθ)] and R(θ)=E[R(uθ(X), uθ(X ), vθ(Y ), vθ(Y ))], where the loss functionals L and R are defined for u, u , v, v Rd and s [0, 1]d as L(u, u , v, v , s):= 1 2 u diag(s)v 2+ 1 2 v diag(s)u 2 u diag(s)v v diag(s)u , (11) R(u, u , v, v ):=(u u )2 (u u ) (u u )+(v v )2 (v v ) (v v )+2d. (12) Therefore, at every epoch we take two independent batches D1 n and D2 n of equal size from Dn, leading to Algorithm 1. See Appendix A.1 for the full discussion, and Appendix A.2, where we also provide in Figure 4 an example of learning dynamics. Algorithm 1 Separable density learning procedure Require: training data (Xtrain,Ytrain) train uθ, σθ and vθ using the NCP loss Center and scale Xtrain and Ytrain for each epoch do From (Xtrain,Ytrain) pick two random batches (Xtrain,Ytrain) and (X train,Y train) Evaluate: U uθ(Xtrain), U uθ(X train), V vθ(Ytrain), V vθ(Y train) Compute b L(θ) as an unbiased estimate using (9) or (11) Compute b R(θ) as an unbiased estimate using (10) or (12) Compute NCP loss b Lγ(θ) := b L(θ) + γ b R(θ) and back-propagate end for Practical guidelines for training In the following, we briefly report a few aspects to be kept in mind when using the NCP in practice, referring the reader to Appendix A for further details. The computational complexity of loss estimation presents three distinct methodological approaches. The first method utilizes unbiased estimation via covariance calculations in (9) and (10), achieving a computational complexity of O(nd2) for a batch size n. An alternative approach employing Ustatistics with (11) and (12) requires O(n2d) operations per iteration, offering the estimation of the same precision. A third method involves batch averaging of (11) and (12), reducing computational complexity to O(nd), which enables seamless integration with contemporary deep learning frameworks, albeit potentially compromising training robustness through less accurate 4th-order moment estimations. Method selection remains contingent upon the specific problem s computational and statistical constraints. Further, the size of latent dimension d, as indicated by Theorem 1 relates to the problem s difficulty in the sense of smoothness of joint density w.r.t. its marginals. Lastly, after the training, an additional post-processing may be applied to ensure the orthogonality of features uθ and vθ and improve statistical accuracy of the learned model. Performing inference with the trained NCP model We now explain how to extract important statistical objects from the trained model (buθ, bvθ, σθ). To this end, define the empirical operator b Dθ Y |X : L2 ν(Y) L2 µ(X) [b Dθ Y |Xf](x):=P i [d] σθ i buθ i (x) b Ey[bvθ i f], f L2 ν(Y), x X, (13) where b Ey[bvθ i f] := 1 n P j [n] bvθ i (yj)f(yj). Then, without any retraining nor simulation, we can compute the following statistics: Conditional Expectation: [b Eθ Y |Xf](x):=b Eyf +[b Dθ Y |Xf](x), f L2 ν(Y), x X. Conditional moments of order α 1: apply previous formula to f(u) = uα. Conditional covariance: d Cov θ(Y |X) := b Eθ Y |X[Y Y ] b Eθ Y |X[Y ]b Eθ Y |X[Y ]. Conditional probabilities: apply the above conditional expectation formula with f(y) = 1B(y), that is, bpy(B) = b Ey[1B] and bpθ(B | x) = bpy(B)+P i [d] σθ i buθ i (x) b Ey[bvθ i 1B], B ΣY, x X. Then, integrating over an arbitrary set A ΣX we get bpθ(B | A) := bpy(B) + P i [d] σθ i b Ex[buθ i 1A] b Ex[1A] b Ey[bvθ i 1B]. (14) Conditional quantiles: for scalar output Y , the conditional CDF b FY |X A(t) is obtained by taking B = ( , t], and in Algorithm 3 in Appendix C we show how to extract quantiles from it. 5 Statistical guarantees We introduce some standard assumptions needed to state our theoretical learning guarantees. To that end, for any A ΣX and B ΣY we define important constants, followed by the main assumption, P[X A] and φY (B):=1 Assumption 1. There exists finite absolute constants cu, cv > 1 such that for any θ Θ ess sup x µ uθ(x) l cu, ess sup y ν vθ(y) l cv. Next, we set σ2 θ(X):=Var( uθ(X) E[uθ(X)] l2), σ2 θ(Y ):=Var( vθ(Y ) E[vθ(Y )] l2) and ϵn(δ):=C (cu cv) d log(eδ 1) n +(σθ(X) σθ(Y )) for some large enough absolute constant C > 0. Remark 1. It follows easily from Assumption 1 that σ2 θ(X) c2 ud and σ2 θ(Y ) c2 vd. Consequently, assuming that n (cu cv)d, then ϵn(δ) (cu cv) log(eδ 1)/n (log(eδ 1)/n)]. Finally, for a given parameter θ Θ and δ (0, 1), let us denote Eθ :=max{ [[DY |X]]d UθSθV θ , U θ Uθ I , U θ 1X , V θ Vθ I , V θ 1Y }, and (16) ψn(δ) := σ d+1 + Eθ + 2 p 1 + Eθ(Eθ + εn(δ)) + [εn(δ)]2. (17) In the following result, we prove that NCP model approximates well the conditional probability distribution w.h.p. whenever the empirical loss b Lγ(θ) is well minimized. Theorem 2. Let Assumption 1 be satisfied, and in addition assume that P(X A) P(Y B) ϵn(δ/3) and n (cu cv)2d _ 8 log(6δ 1) [φX(A) φY (B)] . (18) Then for every A ΣX \ {X} and B ΣY \ {Y} P[Y B | X A] P[Y B] bpθ(B | A) 4ψn(δ/3) + [1+ψn(δ/3)] [2φX(A)+4φY (B)] ϵn(δ/3) p P[X A]P[Y B] , (19) and P[Y B | X A] bpθ(B | A) φY (B)ϵn(δ/3)+2(1+ψn(δ/3))φX(A)ϵn(δ/3)+ψn(δ/3) p P[X A]P[Y B] (20) hold with probability at least 1 δ w.r.t. iid draw of the dataset Dn = (xj, yj)j [n] from ρ. Remark 2. In Appendix B.5, we prove a similar result under a less restrictive sub-Gaussian assumption on the singular functions uθ(X) and vθ(Y ). Discussion The rate ψn(δ) in (17) is pivotal for the efficacy of our method. If we appropriately choose the latent space dimension d to ensure accurate approximation (σ d+1 1), achieve successful training (Eθ σ d+1), and secure a large enough sample size (εn(δ) 1), Theorem 2 provides assurance of accurate prediction of conditional probabilities. Indeed, (20) guarantees (up to a logarithmic factor) P[Y B | X A] bpθ(B | A)=OP P[Y B] P[X A] σ d+1+Eθ+ p d/n+φX(A)/ n ! Note the inclusion of the term p P[X A] in the denominator of the last term on the right-hand side, along with φX(A). This indicates a decrease in the accuracy of conditional probability estimates for rarely encountered event A, aligning with intuition and with a known finite-sample impossibility result Lei and Wasserman (2014, Lemma 1) for conditional confidence regions when A is reduced to any nonatomic point of the distribution (i.e. A = {x} with P[X = x] = 0). For rare events, a larger sample size n and a higher-dimensional latent space characterized by d are necessary for accurate estimation of conditional probabilities. We propose next a non-asymptotic estimation guarantee for the conditional CDF of Y |X when Y is a scalar output. This result ensures in particular that accurate estimation of the true quantiles is possible with our method. Fix t R and consider the set Bt = ( , t] meaning that P[Y Bt|X A]=FY |X A(t) and P[Y Bt]=FY (t). We define similarly for the NCP estimator of the conditional CDF b FY |X A(t)=bpθ(Bt | A). The result follows from applying (20) to the set Bt. Table 1: Mean and standard deviation of Kolmogorov-Smirnov distance of estimated CDF from the truth averaged over 10 repetitions with n = 105 (best method in red, second best in bold black). Model Linear Gaussian Econ Density Arma Jump Skew Normal Gaussian Mixture LGGMD NCP - W 0.010 0.000 0.005 0.001 0.010 0.002 0.008 0.001 0.015 0.004 0.047 0.005 DDPM 0.410 0.340 0.236 0.217 0.338 0.317 0.250 0.224 0.404 0.242 0.405 0.218 NF 0.008 0.006 0.006 0.003 0.143 0.010 0.032 0.002 0.107 0.003 0.254 0.004 KMN 0.601 0.004 0.362 0.017 0.487 0.004 0.381 0.009 0.309 0.001 0.224 0.005 MDN 0.225 0.013 0.048 0.001 0.163 0.018 0.087 0.001 0.129 0.007 0.176 0.013 LSCDE 0.420 0.001 0.118 0.002 0.247 0.001 0.107 0.001 0.202 0.001 0.268 0.024 CKDE 0.120 0.000 0.010 0.001 0.072 0.001 0.023 0.001 0.048 0.001 0.230 0.014 NNKCDE 0.047 0.003 0.036 0.003 0.030 0.004 0.030 0.002 0.035 0.002 0.183 0.006 RFCDE 0.128 0.007 0.141 0.009 0.133 0.015 0.142 0.012 0.130 0.012 0.121 0.006 FC 0.095 0.005 0.011 0.001 0.033 0.002 0.035 0.007 0.016 0.001 0.047 0.003 LCDE 0.108 0.001 0.026 0.001 0.113 0.002 0.075 0.006 0.035 0.001 0.124 0.002 Corollary 1. Let the Assumptions of Theorem 2 be satisfied. Then for any t R and δ (0, 1), it holds with probability at least 1 δ that | b FY |X A(t) FY |X A(t)| p FY (t)(1 FY (t))ϵn(δ/3) FY (t) P[X A] 2 + 1)ϵn(δ/3) + 4φX(A)ϵn(δ/3) . (21) An important application of Corollary 1 lies in uncertainty quantification when output Y is a scalar. Indeed, for any α (0, 1/2), we can scan the empirical conditional CDF b FY |X A for values tα < t α such that b FY |X A(t α) b FY |X A(tα) = 1 α and t α tα is minimal. That way we define a non-asymptotic conditional confidence interval b Bα := (tα, t α] with approximate coverage 1 α. More precisely we deduce from Corollary 1 that |P[Y b Bα | X A] (1 α)| 1 2 + 1)ϵn(δ/6) + 4φX(A)ϵn(δ/6) . (22) In App B.6, we derive statistical guarantees for the conditional expectation and covariance of Y . 6 Experiments Conditional density estimation We applied our NCP method to a benchmark of several conditional density models including those of Rothfuss et al. (2019); Gao and Hastie (2022). See Appendix C.1 for the complete description of the data models and the complete list of compared methods in Tab. 2 with references. We also plotted several conditional CDF along with our NCP estimators in Fig. 6. To assess the performance of each method, we use Kolmogorov-Smirnov (KS) distance between the estimated and the true conditional CDFs. We test each method on nineteen different conditional values uniformly sampled between the 5%- and 95%-percentile of p(x) and computed the averaged performance over all the used conditioning values. In Tab. 1, we report mean performance (KS distance std) computed over 10 repetitions, each with a different seed. NCP with whitening (NCP W) outperforms all other methods on 4 datasets, ties with Flex Code (FC) on 1 dataset, and ranks a close second on another one behind NF. These experiments underscore NCP s consistent performance. We also refer to Tab. 3 in App C.1 for an ablation study on post-treatments for NCP. Confidence regions Our goal is to estimate conditional confidence intervals for two different data models (Laplace and Cauchy). We investigate the performance of our method in (22) and compare it to the popular conditional conformal prediction approach. We refer to App C.2 for a quick description of the principle underlying CCP. We trained an NCP model combined with an MLP architecture followed by whitening post-processing. See App C.2 for the full description. We obtained that 0 1 2 3 4 5 50 NCP (mse = 0.0083) data points predicted expectation true confidence interval estimated confidence interval at level 90% 0 1 2 3 4 5 50 NF (mse = 0.035) 0 1 2 3 4 5 50 Conditional Conformal (mse = 0.0036) 0 1 2 3 4 5 data points true confidence interval estimated confidence interval at level 90% 0 1 2 3 4 5 0 1 2 3 4 5 Conditional Conformal Figure 1: Conditional mean (top only) and 90% confidence interval for NCP, NFs and CCP. Top: Laplace distribution; Bottom: Cauchy distribution. way the NCP conditional CDE model that we used according to (22) to build the conditional 90% confidence intervals. We proceeded similarly to build another set of conditional confidence intervals based on NFs. Finally, we also implemented the CCP method of Gibbs et al. (2023). In Fig. 1, the marginal is X Unif([0, 5]) and Y |X = x follows either a Laplace distribution (top) with location and scale parameters (µ(x), b(x)) = (x2, x) or a Cauchy distribution (bottom) with location and scale parameters (x2, 1 + x). In this experiment, we considered a favorable situation for the CCP method of Gibbs et al. (2023) by assuming prior knowledge that the true conditional location is a polynomial function (the truth is actually the square function). Every other parameter of the method was set as prescribed in their paper. In Fig. 1, observe first that the CCP regression achieves the best estimation of the conditional mean mse = 3.6 10 3 against mse = 3.8 10 2 for NFs and mse = 8.3 10 3 for NCP, as expected since the CCP regression model is well-specified in this example. However, the CCP confidence intervals are unreliable for most of the considered conditioning. We also notice instability for NF and CCP when conditioning in the neighborhood of x = 0, with the NF confidence region exploding at x = 0. We suspect this is due to the fact that the conditional distribution at x = 0 is degenerate, hence violating the condition of existence of a diffeomorphism with the generating prior, a fundamental requirement for NFs models to work at all. Comparatively, NCP does not exhibit such instability around x = 0; it only tends to overestimate the confidence region for conditioning close to x = 0. The Cauchy distribution is known to be more challenging due to its heavy tail and undefined moments. In Fig 1 (bottom), we notice that NF and CCP completely collapse. This is not a surprising outcome since CCP relies on estimation of the mean which is undefined in this case, creating instability in the constructed confidence regions, while NF attempts to build a diffeomorphism between a Gaussian prior and the final Cauchy distribution. We suspect the conservative confidence region produced by NF might originate from the successive Jacobians involved in the NF mapping taking large values. In comparison, our NCP method still returns some reasonable results. Although the NCP coverage might appear underestimated for larger x, actual mean coverages computed on a test set of 200 samples are 88% for NCP, 99% for NF and 79% for CCP. Tab. 5 in Appendix C.2 provides a comparison study on real data for learning a confidence region with NCP, NF and a split conformal predictor featuring a Random Forest regressor (RFSCP). High-dimensional synthetic experiment We simulated the following d-distribution for different values of d {100, 500, 1000}. Let x = ( x1, x2, 0, . . . , 0) Rd where x = ( x1, x2) admits uniform distribution on the 2-dimensional unit sphere. We pick a random mapping A Od and we set X = A x and the angle θ(X) = arcsin( x2). Next we consider two conditional distribution models for Y |X (Gaussian and discrete) described in Figure 3. NCP performs similarly to NF in the Gaussian case and outperforms NF for discrete distribution. Figure 7 in Appendix C.3 demonstrates that NCP scales effectively with increasing dimensionality d. As the dimension rises from d = 100 Figure 2: Protein folding dynamics. Pairwise Euclidean distances between Chignolin atoms exhibit increased variance during folded metastable states (between 87-88µs and around 89.5µs). Ground truth is depicted in blue, predicted mean in orange, and the grey lines indicate the estimated 10% lower and upper quantiles. to d = 1000, the computation time increases by only 20%, while maintaining strong statistical performance throughout. 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 theta = 3.14 theta = 5 theta = 4 theta = 1.57 theta = 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 theta = 3.14 theta = 5 theta = 4 theta = 1.57 theta = 2 Figure 3: High-dimensional synthetic experiment. We consider two models for Y |X with d = 100. Left: Y |X N(θ(X), sin(θ(X)/2). Right: Y {1, 2, 3, 4, 5} admits discrete distribution depending on θ(X): Y |X P1 if θ(X) [0, π/2), P2 if θ(X) [π/2, π), P3 if θ(X) [π, 3π/2), P4 if θ(X) [3π/2, 2π). We take P1 = (1/5, 1/5, 1/5, 1/5, 1/5), P2 = (1/2, 1/2, 0, 0, 0), P3 = (0, 0, 1, 0, 0), P4 = (0, 0, 0, 1/2, 1/2). High-dimensional experiment in molecular dynamics We investigate protein folding dynamics and predict conditional transition probabilities between metastable states. Figure 2 shows how, by integrating our NCP approach with a graph neural network (GNN), we achieve accurate state forecasting and strong uncertainty quantification, enabling efficient tracking of transitions. For further context and a full model description, see App C.3. 7 Conclusion We introduced NCP, a novel neural operator approach to learn the conditional probability distribution from complex and highly nonlinear data. NCP offers a number of benefits. Notably, it streamlines the training process by requiring just one unconditional training phase to learn the joint distribution p(x, y). Subsequently, it allows us to efficiently derive conditional probabilities and other relevant statistics from the trained model analytically, without any additional conditional training steps or Monte Carlo sampling. Additionally, our method is backed by theoretical non-asymptotic guarantees ensuring the soundness of our training method and the accuracy of the obtained conditional statistics. Our experiments on learning conditional densities and confidence regions demonstrate our approach s superiority or equivalence to leading methods, even using a simple Multi-Layer Perceptron (MLP) with two hidden layers and GELU activations. This highlights the effectiveness of a minimalistic architecture coupled with a theoretically grounded loss function. While complex architectures often dominate advanced machine learning, our results show that simplicity can achieve competitive results without compromising performance. Our numerical experiments suggest that, while our approach works well across different datasets and models, the price we pay for this generality appears to be the need for a relatively large sample size (n 104) to start outperforming other methods. Hence, a future direction is to study how to incorporate prior knowledge into our method to make it more data-efficient. Future works will also investigate the performance of NCP for multi-dimensional time series, causality and more general sensitivity analysis in uncertainty quantification. Acknowledgements We acknowledge financial support from EU Project ELIAS under grant agreement No. 101120237, by Next Generation EU and MUR PNRR project PE0000013 CUP J53C22003010006 Future Artificial Intelligence Research (FAIR) and by Next Generation EU and MUR PNRR project RAISE Robotics and AI for Socio-economic Empowerment (ECS00000035). Ambrogioni, L., G uc l u, U., van Gerven, M. A. J., and Maris, E. (2017). The kernel mixture network: A nonparametric method for conditional density estimation of continuous random variables. Andrew, G., Arora, R., Bilmes, J., and Livescu, K. (2013). Deep canonical correlation analysis. In International Conference on Machine Learning, pages 1247 1255. Bercu, B., Delyon, B., and Rio, E. (2015). Concentration Inequalities for Sums and Martingales. Springer Briefs in Mathematics. Springer. Bertin, K., Lacour, C., and Rivoirard, V. (2014). Adaptive pointwise estimation of conditional density function. ar Xiv:1312.7402. Bishop, C. M. (1994). Mixture density networks. Caponnetto, A. and De Vito, E. (2007). Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368. Chanussot, L., Das, A., Goyal, S., Lavril, T., Shuaibi, M., Riviere, M., Tran, K., Heras-Domingo, J., Ho, C., Hu, W., Palizhati, A., Sriram, A., Wood, B., Yoon, J., Parikh, D., Zitnick, C. L., and Ulissi, Z. (2021). Open catalyst 2020 (OC20) dataset and community challenges. ACS Catalysis, 11(10):6059 6072. Chernozhukov, V., W uthrich, K., and Zhu, Y. (2021). Distributional conformal prediction. Proceedings of the National Academy of Sciences, 118(48):e2107794118. Devroye, L., Gy orfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer New York. Dhariwal, P. and Nichol, A. (2021). Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, volume 34, pages 8780 8794. Dinh, L., Krueger, D., and Bengio, Y. (2014). Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516. Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. (2019). Neural spline flows. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc. Freeman, P. E., Izbicki, R., and Lee, A. B. (2017). A unified framework for constructing, tuning and assessing photometric redshift density estimates in a selection bias setting. Monthly Notices of the Royal Astronomical Society, 468(4):4556 4565. Gao, Z. and Hastie, T. (2022). Lincde: conditional density estimation via lindsey s method. Journal of Machine Learning Research, 23(52):1 55. Gibbs, I., Cherian, J. J., and Cand es, E. J. (2023). Conformal prediction with conditional guarantees. ar Xiv:2305.12616. Goldenshluger, A. and Lepski, O. (2011). Bandwidth selection in kernel density estimation: Oracle inequalities and adaptive minimax optimality. Annals of Statistics, 39(3):1608 1632. Hall, P., Wolff, R. C., and Yao, Q. (1999). Methods for estimating a conditional distribution function. Journal of the American Statistical Association, 94(445):154 163. Hao Chen, J. Z., Wei, C., Gaidon, A., and Ma, T. (2022). Provable guarantees for self-supervised deep learning with spectral contrastive loss. In Advances in Neural Information Processing Systems, volume 34, pages 5000 5011. Harrington, L. J. (2017). Investigating differences between event-as-class and probability densitybased attribution statements with emerging climate change. Climatic Change, 141:641 654. Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851. Izbicki, R. and Lee, A. B. (2017). Converting high-dimensional regression to high-dimensional conditional density estimation. Electronic Journal of Statistics, 11(2):2800 2831. Izbicki, R., Lee, A. B., and Freeman, P. E. (2017). Photo-z estimation: An example of nonparametric conditional density estimation under selection bias. The Annals of Applied Statistics, 11(2):698 724. Koltchinskii, V. and Lounici, K. (2017). Concentration inequalities and moment bounds for sample covariance operators. Bernoulli, pages 110 133. Kostic, V., Novelli, P., Grazzi, R., Lounici, K., and Pontil, M. (2024). Learning invariant representations of time-homogeneous stochastic dynamical systems. In International Conference on Learning Representations (ICLR). Langren e, N. and Warin, X. (2020). Fast multivariate empirical cumulative distribution function with connection to kernel density estimation. ar Xiv:2005.03246. Lei, J. and Wasserman, L. (2014). Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):71 96. Li, Q. and Racine, J. S. (2006). Nonparametric Econometrics: Theory and Practice, volume 1 of Economics Books. Princeton University Press. Li, Y., Liu, Y., and Zhu, J. (2007). Quantile regression in reproducing kernel Hilbert spaces. Journal of the American Statistical Association, 102(477):255 268. Lu, Y. and Huang, B. (2020). Structured output learning with conditional generative flows. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):5005 5012. Markowitz, H. M. (1958). Portfolio selection: Efficient diversification of investments. Yale University Press, 23. Mendil, M., Mossina, L., and Vigouroux, D. (2023). Puncc: a python library for predictive uncertainty calibration and conformalization. In Conformal and Probabilistic Prediction with Applications, pages 582 601. PMLR. Minsker, S. (2017). On some extensions of bernstein s inequality for self-adjoint operators. Statistics & Probability Letters, 127:111 119. Nagler, T. and Czado, C. (2016). Evading the curse of dimensionality in nonparametric density estimation with simplified vine copulas. Journal of Multivariate Analysis, 151:69 89. Papamakarios, G., Pavlakou, T., and Murray, I. (2017). Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30. Parzen, E. (1962). On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics, 33(3):1065 1076. Pospisil, T. and Lee, A. B. (2018). Rfcde: Random forests for conditional density estimation. Ray, E. L., Sakrejda, K., Lauer, S. A., Johansson, M. A., and Reich, N. G. (2017). Infectious disease prediction with kernel conditional density estimation. Statistical Medicine, 36(30):4908 4929. Rezende, D. and Mohamed, S. (2015a). Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1530 1538. Rezende, D. and Mohamed, S. (2015b). Variational inference with normalizing flows. In International conference on machine learning, pages 1530 1538. PMLR. Rigollet, P. and Tsybakov, A. (2007). Linear and convex aggregation of density estimators. Mathematical Methods od Statistics, 16:260 280. Romano, Y., Patterson, E., and Candes, E. (2019). Conformalized quantile regression. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc. Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234 241. Springer. Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3):832 837. Rothfuss, J., Ferreira, F., Walther, S., and Ulrich, M. (2019). Conditional density estimation with neural networks: Best practices and benchmarks. ar Xiv preprint ar Xiv:1903.00954. Sch utt, K. T., Hessmann, S. S. P., Gebauer, N. W. A., Lederer, J., and Gastegger, M. (2023). Sch Net Pack 2.0: A neural network toolbox for atomistic machine learning. The Journal of Chemical Physics, 158(14):144801. Sch utt, K. T., Kessel, P., Gastegger, M., Nicoli, K. A., Tkatchenko, A., and M uller, K.-R. (2019). Sch Net Pack: A Deep Learning Toolbox For Atomistic Systems. Journal of Chemical Theory and Computation, 15(1):448 455. Scott, D. W. (1991). Feasibility of multivariate density estimates. Biometrika, 78(1):197 205. Silverman, B. W. (1986). Density estimation for statistics and data analysis, volume 26 of Monographs on Statistics & Applied Probability. Chapman and Hall. Silverman, B. W. (2017). Density Estimation for Statistics and Data Analysis. Routledge, New York. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256 2265. PMLR. Song, J., Vahdat, A., Mardani, M., and Kautz, J. (2023). Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations. Stimper, V., Liu, D., Campbell, A., Berenz, V., Ryll, L., Sch olkopf, B., and Hern andez-Lobato, J. M. (2023). normflows: A pytorch package for normalizing flows. Journal of Open Source Software, 8(86):5361. Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya, H., and Okanohara, D. (2010). Conditional density estimation via least-squares density ratio estimation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 781 788. Tabak, E. G. and Vanden-Eijnden, E. (2010). Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217 233. Tsai, Y.-H. H., Zhao, H., Yamada, M., Morency, L.-P., and Salakhutdinov, R. R. (2020). Neural methods for point-wise dependency estimation. Advances in Neural Information Processing Systems, 33:62 72. Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer Series in Statistics. Vershynin, R. (2011). Introduction to the non-asymptotic analysis of random matrices. ar Xiv:1011.3027. Vovk, V., Gammerman, A., and Saunders, C. (1999). Machine-learning applications of algorithmic randomness. In International Conference on Machine Learning, pages 444 453. Wang, Z., Luo, Y., Li, Y., Zhu, J., and Sch olkopf, B. (2022). Spectral representation learning for conditional moment models. ar Xiv preprint ar Xiv:2210.16525. Wells, L., Thurimella, K., and Bacallado, S. (2024). Regularised canonical correlation analysis: graphical lasso, biplots and beyond. ar Xiv preprint ar Xiv:2403.02979. Winkler, C., Worrall, D., Hoogeboom, E., and Welling, M. (2020). Learning likelihoods with conditional normalizing flows. ar Xiv:1912.00042. Yu, K. and Jones, M. (1998). Local linear quantile regression. Journal of the American Statistical Association, 93(441):228 237. Zhang, G., Ji, J., Zhang, Y., Yu, M., Jaakkola, T. S., and Chang, S. (2023). Towards coherent image inpainting using denoising diffusion implicit models. ar Xiv:2304.03322. Supplemental material The appendix is organized as follows: Appendix A provides additional details on the post-processing for NCP. Appendix B contains the proofs of the theoretical results and additional statistical results. In Appendix C, comprehensive details are presented regarding the experiment benchmark utilized to evaluate the performances of NCP. A Details on training and algorithms A.1 Practical guidelines for training NCP It is better to choose a larger d rather than a smaller one. Typically for the problems we considered in Section 6, we used d {100, 500}. The regularization parameter γ was found to yield the best results for γ {10 2, 10 3}. To ensure the positivity of the singular values, we transform the vector wθ with the Gaussian function x 7 exp( x2) to recover σθ during any call of the forward method. The vector wθ is initialized at random with parameters following a normal distribution of mean 0 and standard deviation 1/d. With the Re LU function, we observe instabilities in the loss function during training, whereas Tanh struggles to converge. In contrast, the use of GELU solves both problems. We can compute some statistical objects as a sanity check for the convergence of NCP training. For instance, we can ensure that the computed conditional CDFsatisfies all the conditions to be a valid CDF. After training, an additional post-processing may be applied to ensure the orthogonality of operators uθ and vθ. This whitening step is described in Alg 2 in App A.3. It leads to an improvement of statistical accuracy of the trained NCP model. See the ablation study in Tab. 3. A.2 Learning dynamics with NCP 0 500 1000 1500 2000 2500 3000 Epochs Training Validation Figure 4: Learning dynamic for the Laplace experiment in Section 6. A.3 Whitening post-processing We describe in Algorithm 2 the whitening post-processing procedure that we apply after training. Algorithm 2 Whitening procedure Require: new data (Xnew,Ynew); trained uθ, σθ and vθ Evaluate u X = uθ(Xtrain) and v Y = vθ(Ytrain) Centering: u X u X ˆE(uθ(Xtrain)) and v Y v Y ˆE(vθ(Ytrain)) u X u Xdiag(σθ) 1 2 and v Y v Y diag(σθ) 1 2 Compute covariance matrices : CX u x ux/n CY v Y v Y/n CXY u Xv Y/n U, V, σnew SVD C 1/2 X CXY C 1/2 Y if (Xnew,Ynew) is different than (Xtrain, Ytrain) then u X uθ(Xnew) ˆE(uθ(Xtrain) diag(σθ) 1 2 v Y vθ(Xnew) ˆE(vθ(Ytrain) diag(σθ) 1 2 end if Final whitening: unew X u XC 1/2 X U vnew Y v Y C 1/2 Y V return unew X , σnew, vnew Y B Proofs of theoretical results B.1 A reminder on Hilbert spaces and compact operators Definition 1. Given a vector space H, we say it is a Hilbert space if there exists an inner product , such that: H is complete with respect to the norm x = p x, x for all x H. An important example of an infinite-dimensional Hilbert space is L2 µ(R), the space of squareintegrable functions w.r.t probability measure µ on R with the inner product defined as f, g = R R f(x)g(x) µ(dx). Definition 2 (Bounded Operators). Let H1 and H2 be Hilbert spaces. A linear operator T : H1 H2 is called bounded if there exists a constant C 0 such that for all x H1, the following inequality holds: Tx H2 C x H1. The smallest such constant C is called the operator norm of T, denoted by T , and is given by: T = sup x =0 Bounded operators are continuous and play a key role in functional analysis. Definition 3 (Compact Operators). Let H1 and H2 be Hilbert spaces. A bounded linear operator T : H1 H2 is called compact if for any bounded sequence {xn} H1, there exists a subsequence {xnk} such that Txnk converges in H2. Compact operators can be viewed as infinite-dimensional analogues of matrices with finite rank in finite-dimensional spaces. A key result in the theory of compact operators is the existence of a singular value decomposition (SVD) for compact operators. The following is the statement of the Eckart-Young-Mirsky theorem: Theorem 3 (Eckart-Young-Mirsky). Let T : H1 H2 be a compact operator between Hilbert spaces. Then T can be decomposed as: i=1 σi , ui vi, where {ui} H1 and {vi} H2 are orthonormal sets, and σi are the singular values of T, which satisfy σ1 σ2 0. Moreover, for any rank-k operator Tk = Pk i=1 σi , ui vi, we have: T Tk = min rank(S) k T S , where is the operator norm induced by the Hilbert spaces. B.2 Proof of Lemma 1 Proof of Lemma 1. It follows from (3) and (5) that P[Y B | X A] P[Y B] 1A, [[DY |X]]d1B P[X A] = 1A, (DY |X [[DY |X]]d)1B Next, by definition of the operator norm, we have | 1A, (DY |X [[DY |X]]d)1B | DY |X [[DY |X]]d L2ν(Y) L2µ(X) 1A L2µ(X) 1B L2ν(Y) = DY |X [[DY |X]]d L2ν(Y) L2µ(X) p where the operator norm DY |X [[DY |X]]d L2ν(Y) L2µ(X) is upper bounded by σ d+1 by definition of the SVD of DY |X. B.3 Proof of Theorem 1 Proof of Theorem 1. In the following, to simplify notation, whenever dependency on the parameters is not crucial, recalling that (X, Y ) and (X , Y ) are two iid samples from the joint distribution ρ, we will denote the vector-valued random variables in the latent (embedding) space as u := uθ(X), u := uθ(X ), v := vθ(Y ) and v := vθ(Y ), as well as s = σθ and S := Sθ. Then, we can write the training loss simply as E [Lγ(u Eu, u Eu , v Ev, v Ev , S)]. Let us further denote centered features as u = u Eu and v = v Ev, and the operators based on them as U θ: Rd L2 µ(X) and V θ: Rd L2 ν(Y) by U θz := z (uθ E[uθ(X)])1X and V θz := z (vθ E[vθ(Y )])1Y, for z Rd. and prove that L0(θ) = U θSθV θ 2 HS 2 tr(SθU θDY |XV θ). Since we have that U θ = JµUθ and V θ = JνVθ, where Jµ = I 1X 1X and Jν = I 1Y 1Y are orthogonal projectors in L2 µ(X) and L2 ν(Y), respectively, as well as U θDY |XV θ = U θ JµDY |XJνVθ = U θ JµEY |XJνVθ, consequently U θDY |XV θ = U θ EY |XVθ U θ 1X (V θ 1Y) = E[uθ(X)E[vθ(Y ) | X]] (E[uθ(X)])(E[vθ(Y )]) , that is U θDY |XV θ = E[uv ] E[u]]E[v] is simply centered cross-covariance in the embedding space. Recalling that U θ Uθ = E[uu ] and V θ Vθ = E[vv ] are covariance matrices in the embedding space, similarly we get that U θU θ = E[(u Eu)(u Eu) ] and V θV θ = E[(v Ev)(v Ev) ] are centered covariances. Thus, we obtain L0(θ) = 2 tr E[(S1/2u)(S1/2v) ] + tr(E[(S1/2u)(S1/2u) ]E[(S1/2v)(S1/2v) ]) = 2E[u Sv] + tr(E[(S1/2u)(S1/2u) ]E[(S1/2v)(S1/2v) ]) which, by taking (X, Y ) and (X , Y ) to be iid random variables drawn from ρ, gives that L0(θ) can be written as E h u Sv u Sv + u Sv + u Sv + 1 2 tr S1/2uu Sv v S1/2 + S1/2u u Svv S1/2 i u Sv 2 (u u )S(v v ) i = E [L0(u, u , v, v , s)] = E [L0(uθ(X) Euθ(X), uθ(X ) Euθ(X ), vθ(Y ) Euθ(Y ), vθ(Y ) Euθ(Y ), σθ)]. which implies that L0(θ) = U θSθV θ 2 HS 2 tr(SθU θDY |XV θ). Moreover, to show that Lγ(θ) = L0(θ) + γ R(θ). It suffices to note that U θ Uθ I 2 F = tr((U θ Uθ I)2) = tr((U θ Uθ)2 2U θ Uθ + I) = = tr(E[uu ]E[u u ] E[uu ] E[u u ]+I) = E[tr(uu u u uu u u +I)] = E (u u )2 u 2 u 2 + d, as well as that U θ 1µ 2 = Eu 2 = (Eu) (Eu) = Eu u , and apply the analogous reasoning for random variable Y ν. Now, given r > d + 1, let us denote Dr := P i [r] σ i u i v i and Lr 0(θ) := Dr U θSθV θ 2 HS Dr 2 HS . (23) Then, applying the Eckhart-Young-Mirsky theorem, we obtain that Lr 0(θ) Pr i=d+1σ 2 i P i [r]σ 2 i = P i [d]σ 2 i , with equality holding whenever (σθ i , uθ i , vθ i ) = (σ i , u i , v i ), ρ-almost everywhere. To prove that the same holds for L0(θ), observe that after expanding the HS norm via trace in (23), we have that Lr 0(θ) = 2 tr S1/2 θ U θDr V θS1/2 θ + UθSθV θ 2 HS , and, consequently, Lr 0(θ) = U θSθV θ 2 HS 2 tr(S1/2 θ U θDr V θS1/2 θ ) = L0(θ) + 2 tr(SθU θ(DY |X Dr)V θ). Thus, using Cauchy-Schwartz inequality, we obtain |Lr 0(θ) L0(θ)| |tr(SθU θ(DY |X Dr)V θ)| Sθ U θ HS DY |X [[DY |X]]r V θ HS, and, therefore, |Lr 0(θ) L0(θ)| σ r+1 p tr(U θ Uθ) tr(V θ Vθ) Mdσ r+1, where the constant is given by M := maxi [d]{ uθ i Euθ i L2µ(X), vθ i Evθ i L2ν(Y)} < . So, Lr 0(θ) Mdσ r+1 L0(θ) Lr 0(θ) + Mdσ r+1, and, since r > d + 1 was arbitrary, we can take r arbitrary large to obtain σ r 0 and conclude that L0(θ) P i [d] σ 2 i , with equality holding when (σθ i , uθ i , vθ i ) = (σ i , u i , v i ), ρ-almost everywhere, since then U θDY |XV θ = U θ DY |XVθ = U θ UθSθ = Sθ = diag(σ1, . . . , σd). Finally, we prove that γ > 0 and σ d > σ d+1 assure uniqueness of the global optimum. First, if the global minimum is achieved σ d > σ d+1 allows one to use uniqueness result in the Eckhart-Young Mirsky theorem that states that P i [d] σ i u i bvθ i = P i [d] σθ i buθ i bvθ i . But since, γ > 0 implies that R(θ) = 0, i.e. (uθ i )i [d] L2 µ(X) and (vθ i )i [d] L2 ν(Y) are two orthonormal systems in the corresponding orthogonal complements of constant functions, using the uniqueness of SVD, the proof is completed. B.4 Proof of Theorem 2 Proof of Theorem 2. Let us denote the operators arising from centered and empirically centered features as U θ, b Uθ : Rd L2 µ(X) and b Vθ, V θ : Rd L2 ν(Y) by U θz := z (uθ E[uθ(X)])1X , V θz := z (vθ E[vθ(Y )])1Y and b Uθz := z buθ, b Vθz := z bvθ, respectively, for z Rd. We first bound the error of the conditional expectation model as DY |X b Dθ Y |X as follows. DY |X b Dθ Y |X = DY |X [[DY |X]]d U θ SθVθ U θSθV θ b U θ Sθ b Vθ σ d+1 + Eθ + U θ SθVθ U θSθV θ + U θSθV θ b U θ Sθ b Vθ . Next, using that Sθ 1 and that centered covariances are bounded by uncentered ones, i.e. U θU θ U θ Uθ, we have U θ SθVθ U θSθV θ = U θ SθVθ U θSθVθ U θSθV θ Uθ U θ Vθ + U θ Vθ V θ U θ 1X V θ Vθ 1/2 + V θ 1Y U θ Uθ 1/2 2Eθ p In a similar way, we obtain U θSθV θ b U θ Sθ b Vθ = U θSθV θ b U θ SθV θ b U θ Sθ b Vθ U θ b Uθ V θ + b Uθ V θ b Vθ U θ b Uθ V θ + V θ b Vθ U θ + U θ b Uθ V θ b Vθ 1 + Eθ b Ex[uθ] E[uθ(X)] + b Ey[vθ] E[vθ(Y )] + b Ex[uθ] E[uθ(X)] b Ey[vθ] E[vθ(Y )] 1 + Eθεn(δ) + [εn(δ)]2. where b Ex[uθ] E[uθ(X)] εn(δ) and b Ey[vθ] E[vθ(Y )] εn(δ) hold w.p.a.l. 1 δ in view of Lemma 2. To summarize, it holds w.p.a.l. 1 δ DY |X b Dθ Y |X σ d+1 + Eθ + 2 p 1 + Eθ(Eθ + εn(δ)) + [εn(δ)]2 =: ψn(δ). (24) By definition in (5) and (14), we have P[Y B | X A] bpθ(B | A) = E[1B(Y )] b Ey[1B] + 1A, DY |X1B E[1A(X)] 1A, b Dθ Y |X1B b Ex[1A] , and 1A, DY |X1B E[1A(X)] = 1A, (DY |X b Dθ Y |X)1B E[1A(X)] + 1A, b Dθ Y |X1B b Ex[1A] b Ex[1A] E[1A(X)]. Note also that 1A(X) L2µ(X) = p E[1A(X)] = p P[X A], 1B(Y ) L2ν(Y) = p E[1B(Y )] = p P[Y B], for any A ΣX and B ΣY and | 1A, (DY |X b Dθ Y |X)1B | DY |X b Dθ Y |X 1A(X) L2 µ(X) 1B(Y ) L2 ν(Y). Combining the previous observations, we get |P[Y B | X A] bpθ(B | A)| |b Ey[1B] E[1B(Y )]| E[1B(Y )] + DY |X b Dθ Y |X p E[1A(X)]E[1B(Y )] + | 1A, b Dθ Y |X1B | b Ex[1A] |b Ex[1A] E[1A(X)]| E[1A(X)] , (25) | 1A, b Dθ Y |X1B | b Ex[1A] E[1A(X)] | 1A, DY |X1B | E[1A(X)] + | 1A, (DY |X b Dθ Y |X)1B | DY |X + DY |X b Dθ Y |X s E[1B(Y )] E[1A(X)] E[1B(Y )] E[1A(X)] 1 + DY |X b Dθ Y |X , (26) where we have used that DY |X 1. Similarly, we have P[Y B | X A] P[Y B] bpθ(B | A) = 1A, DY |X1B E[1A(X)]E[1B(Y )] 1A, b Dθ Y |X1B b Ex[1A]b Ey[1B] = 1A, (DY |X b Dθ Y |X)1B b Ex[1A]b Ey[1B] + 1A, b Dθ Y |X1B 1 E[1A(X)]E[1B(Y )] 1 b Ex[1A]b Ey[1B] = 1A, (DY |X b Dθ Y |X)1B b Ex[1A]b Ey[1B] + 1A, b Dθ Y |X1B E[1A(X)]E[1B(Y )] (b Ex[1A] E[1A(X)])b Ey[1B] + E[1A(X)](b Ey[1B] E[1B(Y )]) b Ex[1A]b Ey[1B] = 1A, (DY |X b Dθ Y |X)1B b Ex[1A]b Ey[1B] + 1A, b Dθ Y |X1B E[1A(X)]E[1B(Y )] b Ex[1A] E[1A(X)] b Ex[1A] + E[1A(X)](b Ey[1B] E[1B(Y )]) b Ex[1A]b Ey[1B] Next Lemmas 3 and 4 combined with (18) and elementary algebra give w.p.a.l. 1 2δ that b Ex[1A] E[1A(X)] 2φX(A)ϵn(δ), b Ey[1B] E[1B(Y )] 2φY (B)ϵn(δ), b Ex[1A] E[1B(Y )] b Ey[1B] 2, E[1A(X)](b Ey[1B] E[1B(Y )]) b Ex[1A]b Ey[1B] 4φY (B)ϵn(δ). It also holds on the same probability event as above that 1A, (DY |X b Dθ Y |X)1B b Ex[1A]b Ey[1B] DY |X b Dθ Y |X p E[1A(X)]E[1B(Y )] b Ey[1B] 4 DY |X b Dθ Y |X p E[1A(X)]E[1B(Y )] . Combining Lemma 2 and (24), we get with probability at least 1 δ that DY |X b Dθ Y |X ψn(δ). By a union bound combining the last two displays with (24), (27), (25) and (26), we get with probability at least 1 3δ P[Y B | X A] P[Y B] bpθ(B | A) 4ψn(δ) + [1 + ψn(δ)] [2φX(A) + 4φY (B)] ϵn(δ) p E[1A(X)]E[1B(Y )] , (28) and P[Y B | X A] bpθ(B | A) φY (B)ϵn(δ) + 2(1 + ψn(δ))φX(A)ϵn(δ) + ψn(δ) p E[1A(X)]E[1B(Y )] . (29) Replacing δ by δ/3, we get the result w.p.a.l. 1 δ. The following result will be useful to investigate the theoretical properties of the NCP method in the iid setting. Lemma 2. Let Assumption 1 be satisfied. Assume in addition that n c2 ud. Then there exists an absolute constant C > 0 such that, for any δ (0, 1), it holds w.p.a.l. 1 δ b Ex[uθ] E[uθ(X)] C cu Similarly, if n c2 vd, w.p.a.l. 1 δ b Ey[vθ] E[vθ(Y )] Ccv Proof of Lemma 2. We note that b Ex[uθ] E[uθ(X)] = 1 i=1 Zi with Zi = uθ(Xi) Euθ(Xi), i [n]. We note that Zi 2cu d =: U and Var(Zi) = Var( uθ(Xi) E[uθ(Xi)] ) = σ2 θ(X) for any i [n]. We apply Minsker (2017, Corollary 4.1) to get for any t 1 U 2 + 36nσ2 θ(X)), 28 exp t2/2 nσ2 θ(X) + t U/3 Replacing t by nt and some elementary algebra give for any t 1 n2 + 36 σ2 θ(X) w.p.a.l. 1 28 exp ( t), 3 t n + 2σθ(X) Replacing t by t + c, we get for any t 0, w.p.a.l. 1 28 exp ( t + c), Up to a rescaling of the constants, there exists a numerical constant C > 0 such that for any δ (0, 1), w.p.a.l. 1 δ n c + σθ(X) Elementary computations give the following bound, that is, there exists a numerical constant C > 0 such that for any t > 0, w.p.a.l. 1 exp( t) d n c2 u d n2 σ3/2 θ (X) n3/4 σ2 θ(X) d t n + σθ(X) Under Assumption 1 and the condition c2 u d n 1, it also holds that σ2 θ(X) n 1 since σ2 θ(X) c2 ud. Consequently, the bound simplifies and we obtain for any t > 1, w.p.a.l. 1 exp( t) i=1 Zi C cu where C > 0 is possibly a different absolute constant from the previous bound. Taking t = log eδ 1 for any δ (0, 1) gives the first result. We proceed similarly to get the second result. Control on empirical probabilities We derive now a concentration result for empirical probabilities. Lemma 3. For any A ΣX and any δ (0, 1), it holds w.p.a.l. 1 δ |b Ex[1A] E[1A(X)]| 2log 2δ 1 P[X A](1 P[X A]) Assume in addition that P(X A) 2 q n . Then it holds w.p.a.l. 1 δ |b Ex[1A] E[1A(X)]| Proof. We note that b Ex[1A(X)] E[1A(X)] = 1 i=1 Zi with Zi = 1A(Xi) E[1A(Xi)], i [n]. We note that |Zi| 2 and Var(Zi) = P[X A](1 P[X A]). Then Bercu et al. (2015, Theorem 2.9) gives w.p.a.l. 1 2δ |b Ex[1A] E[1A(X)]| 2log δ 1 P[X A](1 P[X A]) Dividing by E[1A(X)] gives w.p.a.l. 1 2δ |b Ex[1A] E[1A(X)]| [2 log(δ 1)/n] (1 P[X A]) Replacing δ by δ/2 gives the result for X. The result for Y follows from a similar reasoning. The same proof argument gives an identical result for Y . Lemma 4. For any B ΣY and any δ (0, 1), it holds w.p.a.l. 1 δ |b Ey[1B] E[1B(Y )]| 2log 2δ 1 P[Y B](1 P[Y B]) Assume in addition that P(Y B) 2 q n . Then it holds w.p.a.l. 1 δ |b Ey[1B] E[1B(Y )]| B.5 Sub-Gaussian case Sub-Gaussian setting. We derive another concentration result under a less restricted sub-Gaussian condition on functions uθ and vθ. This result relies on Pinelis and Sakhanenko s inequality for random variables in a separable Hilbert space, see (Caponnetto and De Vito, 2007, Proposition 2). Let ψ2(x) = ex2 1, x 0. We define the ψ2-Orlicz norm of a random variable η as η ψ2 := inf C > 0 : E ψ2 We recall the definition of a sub-Gaussian random vector. Definition 4 (Sub-Gaussian random vector). A centered random vector X Rd will be called sub-Gaussian iff, for all u Rd, X, u ψ2 X, u L2(P). Proposition 1. Caponnetto and De Vito (2007, Proposition 2) Let Ai, i [n] be i.i.d copies of a random variable A in a separable Hilbert space with norm . If there exist constants L > 0 and σ > 0 such that for every m 2, E A m 1 2m!Lm 2σ2, then with probability at least 1 δ i [n] Ai EA Lemma 5 ((Sub-Gaussian random variable) Lemma 5.5. in Vershynin (2011)). Let Z be a random variable. Then, the following assertions are equivalent with parameters Ki > 0 differing from each other by at most an absolute constant factor. 1. Tails: P{|Z| > t} exp(1 t2/K2 1) for all t 0; 2. Moments: (E|Z|p)1/p K2 p for all p 1; 3. Super-exponential moment: E exp(Z2/K2 3) 2. A random variable Z satisfying any of the above assertions is called a sub-Gaussian random variable. We will denote by K3 the sub-Gaussian norm. Consequently, a sub-Gaussian random variable satisfies the following equivalence of moments property. There exists an absolute constant c > 0 such that for any m 2, E|Z|m 1/m c K3 m E|Z|2 1/2. Lemma 6. Assume that uθ(X) E[uθ(X)] and vθ(Y ) E[vθ(Y )] are sub-Gaussian with sub-Gaussian norm K. We set σ2 θ(X) := Var( uθ(X) E[uθ(X)] ), σ2 θ(Y ) := Var( vθ(Y ) E[vθ(Y )] ). Then there exists an absolute constant C > 0 such that for any δ (0, 1), it holds w.p.a.l. 1 δ b Ex[uθ] E[uθ(X)] C n σ2 θ(X) + K2 n log(2δ 1). Similarly, w.p.a.l. 1 δ b Ey[vθ] E[vθ(Y )] C n σ2 θ(Y ) + K2 n log(2δ 1) Proof. Set Z := uθ(X) Euθ(X) and we recall that σ2 θ(X) := Var( uθ(X) E[uθ(X)] ). We check that the moment condition, 2m!Lm 2σ2 θ(X)2, m 2, for some constant L > 0 to be specified. The condition is obviously satisfied for m = 2. Next for any m 3, the Cauchy-Schwarz inequality and the equivalence of moment property give EZm EZ2(m 2) 1/2 EZ4 1/2 4K2 3σ2 θ(X)2 EZ2(m 2) 1/2 . Next, by homogeneity, rescaling Z to Z/K1 we can assume that K1 = 1 in Lemma 5. We recall that if Z is in addition non-negative random variable, then for every integer p 1, we have 0 P{Z t} ptp 1 dt Z 0 e1 t2ptp 1 dt = ep With p = 2(m 2), we get that EZp e(m 2)Γ m 2 = e(m 2)! = em!/2. Using again Lemma 5, we can take L = c K for some large enough absolute constant c > 0. Then Proposition 1 gives the result. B.6 Estimation of conditional expectation and Conditional covariance We now derive guarantees for the estimation of the conditional expectation and the conditional covariance for vector-valued output Y Rdy. We start with a general result for arbitrary vector-valued functions of Y . We consider a vector-valued function h = (h1, . . . , hd) where hj L2 ν(Y) for any j [d]. We introduce the space of square integrable vector-valued functions [L2 ν(Y, Rd)] equipped with the norm j [d] hj 2 L2ν(Y). Next we can define the conditional expectation of h(Y ) = (h1(Y (1)), . . . hd(Y (dy))) conditionally on X A as follows E[h(Y ) | X A] = E[h1(Y )] + 1A, DY |Xh1 P(X A) , . . . , E[hd(Y )] + 1A, DY |Xhd = E[h(Y )] + 1A, [1d DY |X]h We define similarly its empirical version as b Eθ[h(Y ) | X A] = b Ey[h1] + 1A, b Dθ Y |Xh1 b Ex[1A] , . . . , b Ey[hd] + 1A, b Dθ Y |Xhd b Ex[1A] = b Ey[h] + 1A, [1d b Dθ Y |X]h b Ex[1A] . Assuming that h(Y ) is sub-Gaussian, we set K := h(Y ) E[h(Y )] ψ2, σ2(h(Y )) := Var( h(Y ) E[h(Y )] ). Define ψn(δ) := 1 n σ2(h(Y )) + K2 n log(3δ 1) ψn(δ/3) + 2(1 + ψn(δ/3))φX(A)ϵn(δ/3) . Theorem 4. Let the assumptions of Theorem 2 be satisfied. Assume in addition that h(Y ) is sub-Gaussian. Then we have w.p.a.l. 1 δ that b Eθ[h(Y ) | X A] E[h(Y ) | X A] ψn(δ). (32) Proof. We have E[h(Y ) | X A] b Eθ[h(Y )|X A] E[h(Y )] b Ey[h] + DY |X b Dθ Y |X p P(X A) h +| 1A, [1d b Dθ Y |X]h | 1 b Ex[1A] 1 P(X A) E[h(Y )] b Ey[h] + DY |X b Dθ Y |X p P(X A)( DY |X + DY |X b Dθ Y |X ) h P(X A) b Ex[1A] b Ex[1A]P(X A) Recall that DY |X 1, (24) and Lemma 3. Hence, a union bound we get with w.p.a.l. 1 2δ that E[h(Y ) | X A] b Eθ[h(Y )|X A] E[h(Y )] b Ey[h] + h p ψn(δ)+2(1+ψn(δ))φX(A)ϵn(δ) . (34) We now handle the first term E[h(Y )] b Ey[h] . We recall that a similar quantity was already studied in Lemma 6. We can just replace uθ(X) by h(Y ) Rd to get the result since we assumed that h(Y ) is sub-Gaussian. Hence there exists an absolute constant C > 0 such that w.p.a.l. 1 δ E[h(Y )] b Ey[h] C n σ2(h(Y )) + K2 n log(2δ 1). Actually, we can handle the conditional expectation E[Y | X A] in a more direct way. Set _ log(δ 1dy) Corollary 2. Let the Assumptions of Theorem 2 be satisfied. Assume in addition that Y is a sub-Gaussian vector. Then for any δ (0, 1), it holds with probability at least 1 δ that E[Y | X A] b Eθ[Y |X A] p tr(Cov(Y ))ϵn(δ/3) ψn(δ/3) + 2(1 + ψn(δ/3))φX(A)ϵn(δ/3) =: ψ(1) n (δ). (35) Proof. The proof of this result is identical to that of Theorem 4 up to (34). Now if we specify h(Y ) = Y Rdy. Then, applying Bernstein s inequality on each of the dy components of E[Y ] Y n and a union bound, we get w.p.a.l. 1 δ E[Y ] Y n p tr(Cov(Y )) n + max j [dy] Y (j) ψ2 log(δ 1dy) Using again Definition 4, we obtain maxj [dy] Y (j) ψ2 p tr(Cov(Y )). It follows from the last two displays, w.p.a.l. 1 δ E[Y ] Y n p tr(Cov(Y ))ϵn(δ). (36) A union bound combining the previous display with (34) gives the first result. We focus now on the conditional covariance estimation problem. We first define the conditional covariance as follows: Cov(Y |X A) = Cov(Y ) + 1A, [(1dy 1dy) DY |X]h h /P[X A] 1A, [1dy DY |X]h 1A, [1dy DY |X]h /(P[X A])2. (37) Note that 1A, [(1dy 1dy) DY |X]h h = ( 1A, DY |Xhjhk )j,k [dy] is a dy dy matrix. We obtain a similar decomposition for the estimator d Cov θ(Y |X A) of the conditional covariance Cov(Y |X A) by replacing DY |X by b Dθ Y |X: d Cov θ(Y |X A) := d Cov(Y ) + 1A, [(1dy 1dy) b Dθ Y |X]h h /b Ex[1A] 1A, [1dy b Dθ Y |X]h 1A, [1dy b Dθ Y |X]h /(b Ex[1A])2. (38) We define the effective of covariance matrix Cov(Y ) as follows: r(Cov(Y )) := tr(Cov(Y )) We set for any δ (0, 1) ϵ(2) n (δ) := Cov(Y ) n + r(Cov(Y )) n + log(δ 1) ψ(2) n (δ) = ϵ(2) n (δ) + [ψn(δ/4) + 2(1 + ψn(δ/4))φX(A)ϵn(δ/4)] (E[ Y 2])2 + ψ(1) n (δ/4) 2 E[Y | X A] + ψ(1) n (δ/4) . Corollary 3. Let the assumptions of Corollary 2 be satisfied. Then for any δ (0, 1), it holds with probability at least 1 δ that d Cov θ(Y |X A) Cov(Y |X A) ψ(2) n (δ). (40) Proof. We use again the function h(Y ) = Y . We note in view of (37)-(38) that d Cov θ(Y |X A) Cov(Y |X A) d Cov(Y ) Cov(Y ) DY |X P[X A] b Dθ Y |X b Ex[1A] + E[h(Y ) | X A] E[h(Y ) | X A] b Eθ[h(Y ) | X A] b Eθ[h(Y ) | X A] , (41) Next, we note that DY |X P[X A] b Dθ Y |X b Ex[1A] DY |X P[X A] b Dθ Y |X b Ex[1A] P[X A] DY |X P[X A] b Dθ Y |X b Ex[1A] X j,k [dy] Yj Yk L2ν(Y) DY |X b Dθ Y |X P[X A] + b Dθ Y |X 1 P[X A] 1 b Ex[1A] j,k [dy] Yj Yk L2ν(Y) Remind that Y is a sub-Gaussian vector. Using the equivalence of moments property of sub-Gaussian vector, we get that Yj Yk L2ν(Y) q E[Y 4 j ]E[Y 4 k ] E[Y 2 j ]E[Y 2 k ], j, k [dy]. By a union bound combining the last two displays with (24) and Lemma 3, we get w.p.a.l. 1 2δ DY |X P[X A] b Dθ Y |X b Ex[1A] [ψn(δ) + 2(1 + ψn(δ))φX(A)ϵn(δ)] (E[ Y 2])2 P[X A] . (42) Next, we set u = E[h(Y ) | X A] and ˆu = b Eθ[h(Y ) | X A]. Then we have u u ˆu ˆu u ˆu ( u + ˆu ) u ˆu (2 u + ˆu u ). We apply next Corollary 2 to get w.p.a.l. 1 δ u u ˆu ˆu ψ(1) n (δ) 2 E[Y | X A] + ψ(1) n (δ) . (43) Next Koltchinskii and Lounici (2017, Theorem 4) guarantees that w.p.a.l 1 δ d Cov(Y ) Cov(Y ) ϵ(2) n (δ), (44) where ϵ(2) n (δ) is defined in (39). A union bound combining (41), (42), (43) and (44) gives the result. C Numerical Experiments Experiments were conducted on a high-performance computing cluster equipped with an Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz Sky Lake CPU, 377GB RAM, and an NVIDIA Tesla V100 16Gb GPU. Code is available at https://github.com/CSML-IIT-UCL/NCP. C.1 Conditional Density Estimation To evaluate our method s ability to estimate conditional densities, we tested NCP on six different data models (described in the following paragraph) and compared its performance with ten other methods (detailed in Tab. 2). We assessed the methods performance using the KS distance between the estimated conditional CDF and the true CDF. Additionally, we explored how the performance of each method scales with the number of training samples, ranging from 102 to 105, with a validation set of 103 samples. We tested each method on nineteen different conditional values uniformly sampled between the 5%- and 95%-percentile of p(x). Conditional CDFs were estimated on a grid of 1000 points uniformly distributed over the support of Y . The KS distance between each pair of CDFs was averaged over all the conditioning values. In Tab. 5, we present the mean performance (KS distance standard deviation), computed over 10 repetitions, each with a different random seed. Synthetic data models. We included the following synthetic datasets from Rothfuss et al. (2019) and Gao and Hastie (2022) into our benchmark: Linear Gaussian, a simple univariate linear density model defined as Y = X + N(0, 0.1) where X Unif( 1, 1). Econ Density, an economically inspired heteroscedastic density model with a quadratic dependence on the conditional variable defined as Y = X2 + ϵY , ϵY N(0, 1 + X) where X |N(0, 1)|. Arma Jump, a first-order autoregressive model with a jump component exhibiting negative skewness and excess kurtosis, defined as xt = [c(1 α) + αxt 1] + (1 zt)ϵt + zt [ 3c + 2ϵt] , where ϵt N(0, 0.05) and zt B(1, p) denote a Gaussian shock and a Bernoulli distributed jump indicator with probability p, respectively. The parameters were left at their default value. Gaussian Mixture, a bivariate Gaussian mixture model with 5 kernels where the goal is to estimate the conditional density of one variable given the other. The mixture model is defined as p(X, Y ) = P5 k=1 πk N (µk, Σk) where πk, µk, and Σk are the mixing coefficient, mean vector, and covariance matrix of the k-th distribution. All the parameters were randomly initialized. Skew Normal, a univariate skew normal distribution defined as Y = 2ϕ(X)ψ(αX) where ϕ( ) and ψ( ) are the standard normal probability and cumulative density functions, and α is a parameter regulating the skewness. The parameters were left at their default value. Locally Gaussian or Gaussian mixture distribution (LGGMD) (Gao and Hastie, 2022), a regression dataset where the target y depends on the three first dimensions of x, with seventeen irrelevant features added to x. The features of x are all uniformly distributed between 1 and 1. The first dimension of x gives the mean of Y |X, the second is whether the data is Gaussian or a mixture of two Gaussians, and the third gives its asymmetry. More specifically: 0.5N(0.25X(1) 0.5, 0.25(0.25X(3) + 0.5)2) +0.5N(0.25X(1) + 0.5, 0.25(0.25X(3) 0.5)2) if X(2) 0.2 0N(0.25X(1) 0.5, 0.3) if X(2) > 0.2 (45) To sample data from Econ Density, Arma Jump, Gaussian Mixture, and Skew Normal, we used the library Conditional Density Estimation (Rothfuss et al., 2019) available at https://github. com/freelunchtheorem/Conditional_Density_Estimation. Training NCP. We trained an NCP model with uθ and vθ as multi-layer perceptrons (MLPs), each having two hidden layers of 64 units using GELU activation function in between. The vector σθ has a size of d = 100, and γ is set to 10 3. Optimization was performed over 104 epochs using the Adam optimizer with a learning rate of 10 3. Early stopping was applied based on the validation set with patience of 1000 epochs. To ensure the positiveness of the singular values, we transform the vector σθ with the Gaussian function x 7 exp( x2) during any call of the forward method. Whitening was applied at the end of training. Compared methods. We compared our NCP network with ten different CDE methods. See Tab. 2 for the exhaustive list of models including a brief summary and key hyperparameters. In particular, the methods were set up as follows: NF was characterized by a 1D Gaussian base distribution and two Masked Affine Autoregressive flows (Papamakarios et al., 2017) followed by a LU Linear permutation flow. To match the NCP architecture, each flow was defined by two hidden layers with 64 units each. The training procedure was the same as for the NCP model. The model was implemented using the library normflows (Stimper et al., 2023). DDPM was characterized by a U-Net (Ronneberger et al., 2015), a noise schedule starting from 10 4 to 0.02 and 400 steps of diffusion as implemented in https://github.com/ Tea Pearce/Conditional_Diffusion_MNIST. CKDE s kernels bandwidth was estimated according to Silverman s rule (Silverman, 1986). MDN s architecture was defined by two hidden layers with 64 units each and 20 Gaussians kernels. KMN s architecture was defined by two hidden layers with 64 units each, 50 Gaussians kernels, and kernels bandwidth was estimated according to Silverman s rule (Silverman, 1986). LSCDE was defined by 500 components which bandwidths were set to 0.5 and kernels center found via a k-means procedure. NNKDE s number of neighbors was set using the heuristics k = n (Devroye et al., 1996). Kernels bandwidth was estimated according to Silverman s rule (Silverman, 1986). We used the implementation available at https://github.com/lee-group-cmu/NNKCDE. RFCDE was characterized by a Random Forest with 1000 trees and 31 cosine basis functions. The training was performed using the rfcde library available at https://github.com/ lee-group-cmu/RFCDE. FC was trained using a Random Forest with 1000 trees as a regression method and had 31 cosine basis functions. The training was performed using the flexcode library available at https://github.com/lee-group-cmu/Flex Code. Lin CDE was trained with 1000 Lin CDE trees using the Lin CDE.boost R function from https://github.com/Zijun Gao/Lin CDE. CKDE, MDN, KMN, and LSCDE hyperparameters were set according to Rothfuss et al. (2019) and were trained using the library Conditional Density Estimation available at https://github.com/ freelunchtheorem/Conditional_Density_Estimation. All methods involving the training of a neural network were assigned the same number of epochs given to NCP. All other method parameters were set as prescribed in their paper. Results. See Tab. 4 for the comparison of performances for n = 104. See also Fig. 5. We also carried out an ablation study on centering and whitening post-treatment for NCP in Tab. 3 C.2 Confidence Regions The objective of this next experiment is to estimate a confidence interval at coverage level 90% for two distribution models with different properties (Laplace and Cauchy) and one real dataset in order to showcase the versatility of our NCP approach. Table 2: Compared methods for the CDE problem. Method Summary Main hyperparams Normalizing Flows (NF) (Rezende and Mohamed, 2015b) Generative models that transform a simple distribution into a complex one through a series of invertible and differentiable transformations Architecture Flow type Denoising Diffusion Probabilistic Model (DDPM) (Ho et al., 2020) Generative models that learn to generate data by reversing a gradual noising process, modeling distributions through iterative refinement Number of diffusion steps Noise schedule Conditional KDE (CKDE) (Li and Racine, 2006) Nonparametric approach modeling the joint and marginal probabilities via KDE and computes the conditional density as p(y|x) = p(x, y)/p(x). KDE bandwidth Mixture Density Network (MDN) (Bishop, 1994) Uses Neural Nets which takes conditional x as input and governs all the weights of a GMM modeling p(y|x). Neural Net architecture Number of kernels Kernel Mixture Network (KMN) (Ambrogioni et al., 2017) Similar to MDN with the difference that NN only controls the weights of the GMM. Neural Net architecture Method for finding kernel centers Number of kernels Least-Squares CDE (LSCDE) (Sugiyama et al., 2010) Computes the conditional density as linear combination of Gaussian kernels Method for finding kernel centers Number of kernels Kernels bandwidth Nearest Neighbor Kernel CDE (NNKDE) (Izbicki et al., 2017) (Freeman et al., 2017) Uses nearest neighbors of the evaluation point x to compute a KDE estimation of y. Number of neighbors Kernel bandwidth Random Forest CDE (RFCDE) (Pospisil and Lee, 2018) Uses a random forest to partition the feature space and constructs a weighted KDE of the output space, based on the weights of the leaves in the forest. Random forest hyperparams Basis system Number of basis Flexible CDE (FC) (Izbicki and Lee, 2017) Nonparametric approach which uses a basis expansion of univariate y to turn CDE into a series of univariate regression problems. Number of expansion coeffs Regression method hyperparams. Lin CDE (LCDE) (Gao and Hastie, 2022) Conditional training of unconditional machine learning models to learn density Number of Lin CDE trees Table 3: Ablation study on post-treatment for NCP. We report the mean and std of KS distance of estimated CDF from the truth averaged over 10 repetitions with n = 105 (best method in bold red). NCP C and NCP W refer to our method with centering and whitening post-treatment, respectively. Model Linear Gaussian Econ Density Arma Jump Skew Normal Gaussian Mixture LGGMD NCP 0.040 0.007 0.014 0.003 0.046 0.012 0.023 0.006 0.027 0.008 0.055 0.010 NCP C 0.019 0.006 0.010 0.003 0.037 0.011 0.015 0.004 0.015 0.004 0.048 0.007 NCP W 0.010 0.000 0.005 0.001 0.010 0.002 0.008 0.001 0.015 0.004 0.047 0.005 Table 4: Mean and standard deviation of KS distance of estimated CDF from the truth averaged over 10 repetitions with sample size of 104 (best method in bold red, second best in bold black). NCP C and NCP W refer to our method with centering and whitening post-treatment, respectively. Model Linear Gaussian Econ Density Arma Jump Skew Normal Gaussian Mixture LGGMD NCP 0.046 0.011 0.021 0.009 0.048 0.009 0.043 0.029 0.035 0.004 0.188 0.011 NCP C 0.031 0.008 0.019 0.008 0.038 0.011 0.031 0.013 0.031 0.003 0.189 0.012 NCP W 0.026 0.002 0.016 0.003 0.020 0.002 0.024 0.011 0.030 0.002 0.176 0.014 DDPM 0.414 0.341 0.264 0.240 0.358 0.314 0.284 0.251 0.416 0.242 0.423 0.223 NF 0.011 0.002 0.015 0.003 0.141 0.005 0.039 0.005 0.113 0.006 0.288 0.010 KMN 0.599 0.003 0.349 0.019 0.490 0.007 0.380 0.009 0.306 0.003 0.225 0.008 MDN 0.245 0.011 0.051 0.002 0.164 0.005 0.089 0.002 0.144 0.009 0.232 0.008 LSCDE 0.418 0.003 0.119 0.004 0.250 0.007 0.109 0.002 0.201 0.005 0.295 0.034 CKDE 0.187 0.001 0.023 0.003 0.125 0.002 0.046 0.001 0.085 0.003 0.241 0.021 NNKCDE 0.090 0.002 0.060 0.006 0.063 0.006 0.052 0.005 0.059 0.004 0.207 0.013 RFCDE 0.132 0.009 0.136 0.010 0.130 0.009 0.139 0.009 0.134 0.012 0.162 0.006 FC 0.090 0.004 0.030 0.006 0.042 0.003 0.033 0.002 0.033 0.003 0.065 0.008 LCDE 0.122 0.002 0.029 0.003 0.118 0.003 0.064 0.007 0.050 0.002 0.141 0.004 102 103 104 105 Linear Gaussian 102 103 104 105 Econ Density 102 103 104 105 102 103 104 105 n Skew Normal 102 103 104 105 n Gaussian Mixture 102 103 104 105 n NNKCDE RFCDE Figure 5: Performances for CDE on synthetic datasets w.r.t sample size n. Performance metric is Kolmogorov-Smirnov (KS) distance to truth. Compared methods. We compared our NCP procedure for building conditional confidence intervals to the state-of-the-art conditional conformal prediction method in Gibbs et al. (2023). We also developed another method based on Normalizing Flows estimation of the conditional CDE and we added it to the benchmark. Experiment for Laplace and Cauchy distributions. We generate a dataset where the X variable follows a uniform distribution on interval [0, 5] and Y |X = x follows either a Laplace distribution with location and scale parameters (µ(x), b(x)) = (x2, x) or a Cauchy distribution with location and scale parameters (µ(x), b(x)) = (x2, 1 + x). We create a train set of 50000 samples, a validation set of 1000 samples and a test set of 1000 samples. For the Laplace distribution, we train an NCP where uθ and vθ are multi-layer perceptrons with two hidden layers of 128 cells, σθ is a vector of size d = 500 and γ = 10 2. Between each layer, we use the GELU activation function. We optimize over 5000 epochs using the Adam optimizer with a learning rate of 10 3. We apply early stopping with regard to the validation set with a patience of 100 epochs. Whitening is applied at the end of training. To fit the Cauchy distribution, we increase the depth of the MLPs to 5 and the width to 258. We compare this NCP network with two state-of-the-art methods. The first is a normalizing flow with base distribution a 1D Gaussian and two Autoregressive Rational Quadratic spline flows (Durkan et al., 2019) followed by a LU Linear permutation flow. All flows come from the library normflows (Stimper et al., 2023). The spline flows have each two blocks of 128 hidden units to match the NCP architecture. The normalizing flow is allowed the same number of epochs as ours with the same optimizer. The second model is the conditional conformal predictor from Gibbs et al. (2023). This model needs a regressor as an input. We consider a situation favorable to Gibbs et al. (2023) as we assume as prior knowledge that the true conditional expectation is a polynomial function (the truth is actually the quadratic function in this example). Therefore we chose a linear regression with polynomial features as in Gibbs et al. (2023) as this regressor should fit the data without any problem. For all other choices of parameters, we follow the prescriptions of Gibbs et al. (2023). For the sake of Linear Gaussian Econ Density Arma Jump Skew Normal Gaussian Mixture LGGMD Figure 6: Estimated conditional PDFs (left) and CDFs (right) for each synthetic dataset for 3 different conditioning points. Dotted lines represent the true distributions, while solid lines represent the estimates from NCP. The average KS distance over 5 repetitions is also reported on the right plots. fairness, we note that the validation set used for early stopping in NF and NCP was also used as a calibration set for the CCP method. By design, the Conditional Conformal Predictor (CCP) gives the confidence interval directly. However NCP and NF output the conditional distribution. To find the smallest confidence interval with desired coverage, we apply the linear search algorithm described in Algorithm 3 on the discretized conditional CDFs provided by NCP and NF. The results are provided in Fig. 1. First, observe that although the linear regression achieves the best estimation of the conditional mean, as should be expected since the model is well-specified in this case, the confidence intervals, however, are unreliable for most of the considered conditioning. We also notice instability for NF and CCP for conditioning in the neighborhood of x = 0 with NF confidence region exploding at x = 0. We expect this behavior is due to the fact that the conditional distribution at x = 0 is degenerate. Comparatively, NCP does not exhibit such instability around x = 0. It only tends to overestimate the produced confidence region for conditioning close to x = 0. Algorithm 3 Confidence interval search given a CDF Require: Y a vector of values, FY a vector of realisations of the CDF at points Y , α [0, 1] a confidence level Initialize tlow = 0 and thigh = 1 Initialize t low = 0 and t high = 1 Initialize s = while Center and scale Xtrain and Ytrain do if FY [thigh] FY [tlow] α then size = Y [thigh] Y [tlow] if size < s then t low = tlow, t high = thigh, s =size end if tlow = tlow + 1 else if thigh = len(Y ) 1 then thigh = thigh + 1 end if end while Return Y [tlow], Y [thigh] Experiment on real data. We also evaluate the performance of NCP in estimating confidence intervals using the Student Performance dataset available at https://www.kaggle.com/datasets/ nikhil7280/student-performance-multiple-linear-regression/data. This dataset comprises 10000 records, each defined by five predictors: hours studied, previous scores, extracurricular activities, sleep hours, and sample question papers practiced, with a performance index as the target variable. In this experiment, the NCP s uθ and vθ are defined by MLPs with two hidden layers, each containing 32 units and using GELU activation functions, σθ is a vector of size d = 50 and γ = 10 2. Optimization was performed over 50000 epochs using the Adam optimizer with a learning rate of 10 3. We compare NCP with a normalizing flow defined as above in which spline flows have each two blocks of 32 hidden units to match NCP architecture. The normalizing flow is trained for the same number of epochs as our model, using the same optimizer. We further compare NCP with a split conformal predictor featuring a Random Forest regressor (RFSCP) with 100 estimators. We used the implementation of the library puncc (Mendil et al., 2023). For NCP and the normalizing flow, early stopping is based on the validation set, while for RFSCP, the validation set serves as the calibration set. We performed 10 repetitions, randomly splitting the dataset into a training set of 8000 samples and validation and test sets of 1000 samples each. We report the results of the estimated confidence interval at a coverage level of 90% in Tab. 5. The methods provide fairly good coverage. NF did not respect the 90% coverage condition. Only NCP and RFSCP both respect the coverage condition but the width of the confidence intervals for RFSCP are larger than for NCP. Discussion on Conformal Prediction. Conformal prediction (CP) is a popular model-agnostic framework for uncertainty quantification approach Vovk et al. (1999). CP assigns nonconformity Table 5: Mean and standard deviation of 90% prediction interval (PI) coverages and interval widths, averaged over 10 repetitions for the Student Performance dataset from Kaggle.NCP C and NCP W refer to our method with centering and whitening post-treatment, respectively. Model Coverage 90% PI Width 90% PI NCP C 89.41% 2.12% 0.39 0.02 NCP W 91.02% 0.72% 0.38 0.01 NF 89.10% 1.07% 0.35 0.00 RFSCP 90.03% 1.06% 0.41 0.01 scores to new data points. These scores reflect how well each point aligns with the model s predictions. CP then uses these scores to construct a prediction region that guarantees the true outcome will fall within it with a user-specified confidence parameter. However, CP is not without limitations. The construction of these guaranteed prediction regions can be computationally expensive especially for large datasets, and need to be recomputed from scratch for each value of the confidence level parameter. In addition, the produced CP confidence regions tend to be conservative. Another limitation of regular CP is that predictions are made based on the entire input space without considering potential dependencies between variables. Conditional conformal prediction (CCP) was later developed to handle conditional dependencies between variables, allowing in principle for more accurate and reliable predictions Gibbs et al. (2023). CCP suffers from the typical limitations of regular CP and the theoretical guarantees. C.3 High-dimensional Experiments Experiment on high-dimensional synthetic data. In Fig. 3, we trained NCP for d = 100 using the same MLP architecture and the same NF with autoregressive flow as in our initial experiments based on n = 105 samples {(Xi, Yi)}n i=1 with values in Rd Y. We plot the conditional CDF for several conditioning w.r.t. θ(x) on 10 repetitions. NCP paired with a small MLP architecture performs comparably to the NF model for Gaussian distributions. For discrete distributions, the NCP demonstrates superior performance compared to the NF model. We repeated the experiment in Fig. 3 for d {100, 200, 500, 1000} and recorded the average Kolmogorov-Smirnov (KS) distance of the NCP conditional distribution to the truth, computation time and their standard deviations over 10 repetitions. 200 400 600 800 1000 Ambient dimension of the features Execution time (in s) for 1000 epochs 100 500 1000 θ mean std mean std mean std 1.0 0.057 0.019 0.079 0.050 0.062 0.016 1.57 0.041 0.009 0.069 0.042 0.049 0.017 3.14 0.030 0.016 0.116 0.173 0.036 0.010 5.0 0.067 0.052 0.131 0.175 0.072 0.036 Figure 7: Left: we observe only 20% increase in compute time going from d = 102 to d = 103. Right: average KS distance to the truth and standard deviation over 10 repetitions. High-dimensional experiment in molecular dynamics: Chignolin folding. We investigated the dynamics of Chignolin folding, using a molecular dynamics simulation lasting 106µs and sampled every 200ps, resulting in 524, 743 data points. Our analysis focuses on 39 heavy atoms (nodes) with a cutoff radius of 5 Angstroms. To predict the conditional transition probability between metastable states, we integrate our NCP approach with a graph neural network (GNN) model. GNNs, as demonstrated by Chanussot et al. (2021), represent the state-of-the-art in modeling atomistic systems, adeptly incorporating the roto-translational and permutational symmetries inherent in physical systems. In particular, we employed a Sch Net model Sch utt et al. (2019, 2023) with three interaction blocks. Each block features a 64-dimensional latent atomic environment, and the inter-atomic distances for message passing are expanded over 20 radial basis functions. After the final interaction block, each latent atomic environment is processed through a linear layer and then aggregated by averaging. The model underwent training for 100 epochs using an Adam optimizer with a learning rate of 10 3. We employed a batch size of 256 and set γ to 10 3. In Fig. 2, we show how our NCP approach enables the tracking transitions between metastable states, demonstrating accurate forecasting and strong uncertainty quantification. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: see abstract, Introduction and section Related works. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: see discussion under main results in Section Theoretical guarantees and conclusion. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate Limitations section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: Yes each statement clearly state all required assumptions and all proofs are provided in Appendix B. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: The paper presents the pseudo-code of our method, links to the datasets and methods used to reproduce our results Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: While the paper is predominantly theoretical, we have presented experiments which illustrate our theory. Data and code can be made available upon request during the rebuttal and will be made readily available should the paper be accepted. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Appendix C provides all details on the architecture used for our method in order to reproduce the experiment. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Whenever appropriate, we provided standard deviations for the performance of the compared methods computed over several repetitions. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer Yes if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provided the information at the beginning of Appendix C. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: This is a theoretical paper. Experiments were carried out on synthetic data or publicly available data. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Paper of theoretical nature. There are no particular concerns. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: there is no such risk. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: all existing methods used in our experimental study were properly cited in the main paper and/or Appendix C. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: This is a theoretical work. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: see above. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: see above. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.