# density_ratio_estimation_with_conditional_probability_paths__3ffb2297.pdf Density Ratio Estimation with Conditional Probability Paths Hanlin Yu 1 Arto Klami 1 Aapo Hyv arinen 1 Anna Korba 2 Omar Chehab 2 Density ratio estimation in high dimensions can be reframed as integrating a certain quantity, the time score, over probability paths which interpolate between the two densities. In practice, the time score has to be estimated based on samples from the two densities. However, existing methods for this problem remain computationally expensive and can yield inaccurate estimates. Inspired by recent advances in generative modeling, we introduce a novel framework for time score estimation, based on a conditioning variable. Choosing the conditioning variable judiciously enables a closed-form objective function. We demonstrate that, compared to previous approaches, our approach results in faster learning of the time score and competitive or better estimation accuracies of the density ratio on challenging tasks. Furthermore, we establish theoretical guarantees on the error of the estimated density ratio. 1. Introduction Estimating the ratio of two densities is a fundamental task in machine learning, with diverse applications (Sugiyama et al., 2010). For instance, by assuming that one of the densities is tractable, often a standard Gaussian, we can construct an estimator for the other density by estimating their ratio (Gutmann & Hyv arinen, 2012; Gao et al., 2019; Rhodes et al., 2020; Choi et al., 2022). It is also possible to consider a scenario where both densities are not tractable. As noted by previous works (Choi et al., 2022), density ratio estimation finds broad applications across machine learning, from mutual information estimation (Song & Ermon, 2020), generative modelling (Goodfellow et al., 2020), importance sampling (Sinha et al., 2020), likelihood-free inference (Izbicki et al., 2014) to domain adaptation (Wang et al., 2023). 1University of Helsinki, Finland 2ENSAE, CREST, IP Paris, France. Correspondence to: Hanlin Yu . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). Figure 1: Densities are shown in blue. Left: A bi-modal probability path transitioning from a Gaussian distribution (t = 0) to a mixture of Diracs (t = 1). This path is estimated using time scores , which are not available in closed form in general; they are depicted by arrows, with magnitudes ranging from low (gray) to high (red). Right: A useful decomposition of the probability path and time scores is obtained by conditioning on a final data point. The ensuing conditional density is Gaussian, and thus, the ensuing conditional time scores are analytically tractable. We propose to use this decomposition to estimate the time scores . The seminal work by Gutmann & Hyv arinen (2012) proposed a learning objective for estimating the ratio of two densities, by identifying from which density a sample is drawn. This can be done by binary classification. However, their estimator has a high variance when the densities have little overlap, which makes it impractical for problems in high dimensions (Lee et al., 2023; Chehab et al., 2023b). To address this issue, Rhodes et al. (2020) proposed connecting the two densities with a probability path and estimating density ratios between consecutive distributions. Since two consecutive distributions are close to each other, the statistical efficiency may improve at the cost of increased computation, as there are multiple binary classification tasks to solve. Choi et al. (2022) examined the limiting case where the intermediate distributions become infinitesimally close. In this limit, the density ratio converges to a quantity known as the time score, which is learnt by optimizing a Time Score Matching (TSM) objective. While this limiting case leads to empirical improvements, the TSM objective is computationally inefficient to optimize, and the resulting Density Ratio Estimation with Conditional Probability Paths estimator may be inaccurate. Moreover, it is unclear what are the theoretical guarantees associated with the estimators. In this work, we address these limitations. First, in Section 3 we introduce a novel learning objective for the time score, which we call Conditional Time Score Matching (CTSM). It is based on recent advancements in generative modeling (Vincent, 2011; Pooladian et al., 2023; Tong et al., 2024a), which consider probability paths that are explicitly decomposed into mixtures of simpler paths, and where the time score is obtained in closed form. We demonstrate empirically that the CTSM objective significantly accelerates optimization in high-dimensional settings, and is several times faster compared to TSM. Second, in Section 4 we modify our CTSM objective with a number of techniques that are popular in generative modeling (Song et al., 2021b; Choi et al., 2022; Tong et al., 2024a) to ease the learning. In particular, we derive a closed-form weighting function for the objective, as well as a vectorized version of the objective which we call Vectorized Conditional Time Score Matching (CTSM-v). Together, these modifications substantially improve the estimation of the density-ratio in high dimensions, leading to stable estimators and significant speedups. Third, in Section 5 we provide theoretical guarantees for density ratio estimation using probability paths, addressing a gap in prior works (Rhodes et al., 2020; Choi et al., 2022). 2. Background Our goal is to estimate the ratio between two densities p0 and p1, given samples from both. We start by defining a distribution over labels t and data points x, p(x, t) = p(t)p(x |t) (1) constructed such that we recover p0 and p1 for t = 0 and t = 1 respectively. We next show how several relevant methods can be viewed as variations on this formalism. Binary label Fundamental approaches to density-ratio estimation consider a binary label t {0, 1}. Among them, Noise Contrastive Estimation (NCE) is based on the observation that the density ratio is related to the binary classifier p(t|x) (Gutmann & Hyv arinen, 2012, Eq. 5). NCE estimates that classifier by minimizing a binary classification loss based on logistic regression, computed using samples drawn from p0 and p1. In practice, using NCE is challenging when p0 and p1 are far apart . In that case, both the binary classification loss becomes harder to optimize (Liu et al., 2022) and the sample-efficiency of its minimizer deterioriates (Gutmann & Hyv arinen, 2012; Lee et al., 2023; Chehab et al., 2023a;b). Continuous label More recent developments relax the label so that it is continuous t [0, 1]. Now, conditioning on t defines intermediate distributions p(x |t), equivalently noted pt(x), along a probability path that connects p0 to p1. Then, the following identity is used (Choi et al., 2022) p0(x) = Z 1 0 t log pt(x)dt, (2) or its discretization in time (Rhodes et al., 2020). Probability path We next consider a popular use-case, where p0 is a Gaussian and p1 is the data density (Rhodes et al., 2020; Choi et al., 2022); since p0 is known analytically, the ratio of the two provides directly an estimator for p1. In practice, one can construct a probability path where the intermediate distributions can be sampled from but their densities cannot be evaluated. This is because the probability path is defined by interpolating samples from p0 and p1. There are multiple ways to define such interpolations (Rhodes et al., 2020; Albergo & Vanden-Eijnden, 2023), which we will further discuss in Section 4. A widely used approach is the Variance-Preserving (VP) probability path, which can be simulated by (Song et al., 2021b; Lipman et al., 2023; Choi et al., 2022) 1 α2 t x0, (3) where x0 N(0, I), x1 p1 follows the data distribution, time is drawn uniformly t U[0, 1] and αt [0, 1] is a positive function that increases from 0 to 1. By conditioning on t, we obtain densities pt(x) = 1 1 α2 t p0( x αt ) that cannot be computed in closed-form, given that the density p1 is unknown and that the convolution requires solving a difficult integral. Estimating the time score Importantly, the identity in Eq. 2 requires estimating the time score t log pt(x), which is the Fisher score where the parameter is the label t. It can also be related to the binary classifier between two infinitesimally close distributions pt and pt+dt (Choi et al., 2022, Proposition 3). Formally, this time score can be approximated by minimizing the following Time Score Matching (TSM) objective LTSM(θ) = Ep(t,x) λ(t) t log pt(x) sθ(x, t) 2 , (4) where λ(t) is any positive weighting function. This objective requires evaluating the time score t log pt(x). However, as previously explained, the formula for the time score is unavailable because the densities pt, while well-defined, are not known in closed form. To make the learning objective in Eq. 4 tractable, an insight from Hyv arinen (2005) led Choi et al. (2022); Williams et al. Density Ratio Estimation with Conditional Probability Paths (2025) to rewrite it using integration by parts. This yields LTSM(θ) = 2Ep0(x)[sθ(x, 0)] 2Ep1(x)[sθ(x, 1)]+ Ep(t,x)[2 tsθ(x, t) + 2 λ(t)sθ(x, t) + λ(t)sθ(x, t)2], (5) which no longer requires evaluating the time score t log pt(x). However, this approach has one clear computational drawback: differentiating the term tsθ(x, t) in the loss Eq. 5 involves using automatic differentiation twice first in t and then in θ which can be time-consuming (we verify this in Section 6). This motivates us to find better ways of learning the time score. 3. Novel Objectives for Time Score Estimation In this section, we propose novel methods to estimate the time score. 3.1. Basic Method Augmenting the state space First, we rewrite Eq. 4 so that it is tractable. The idea is to further augment the state space to (x, t, z) by introducing a conditioning variable z, as in related literature. Thus, we extend the model from Eq. 1 into p(x, t, z) = p(t)p(z)p(x |t, z), (6) such that the intermediate distributions p(x |t, z) now conditioned on z can be sampled from and evaluated. We remark that this insight is shared by previous research in score matching Vincent (2011) and flow matching (Lipman et al., 2023; Pooladian et al., 2023; Tong et al., 2024a). Consider for example Eq. 3. By choosing to condition on z = x1, we get a closed-form p(x |t, z) = N(x; αt z, (1 α2 t)I). In this example, z is a sample of raw data (for example, real observed data) while x is a corrupted version of data, and t controls the corruption level, ranging from 0 (full corruption) to 1 (no corruption), as in Vincent (2011). In the following, we explain how to relate the descriptions of the intractable marginal probability path pt(x) to descriptions of the tractable conditional probability path pt(x | z). Tractable objective for learning the time score As a result of Eq. 6, we relate the time scores, obtained with and without conditioning on z (derivations are in Appendix D.1) t log pt(x) = Ept(z | x) [ t log pt(x | z)] (7) and exploit this identity to learn the time score, by plugging Eq. 7 into the original loss in Eq. 4. This way, we can reformulate the intractable objective in Eq. 4 into a tractable objective which we call the Conditional Time Score Matching (CTSM) objective LCTSM(θ) = Ep(x,z,t) λ(t) t log pt(x | z) sθ(x, t) 2 . (8) Note that the regression target is given by the time score of the conditional distribution, t log pt(x | z). The reformulation is justified by the following theorem: Theorem 1 (Regressing the time score) The TSM loss Eq. 4 and CTSM loss Eq. 8 are equal, up to an additive constant. The proof can be found in Appendix D.2. This new objective is useful, as it requires evaluating the time score of the tractable distribution pt(x | z) instead of the intractable distribution pt(x). By minimizing this objective, the model sθ(x, t) learns to output t log pt(x). A similar observation was made in De Bortoli et al. (2022, Appendix L.3.), however they did not translate this observation into the CTSM objective and use it for learning. Furthermore, their setting was more restrictive, as the conditioning variable was specifically chosen to be x1. 3.2. Vectorized Variant We propose a further objective for learning the time score, called Vectorized Conditional Time Score Matching (CTSMv). The idea is that we can easily vectorize the learning task, by forming a joint objective over the D dimensions. The intuition is that the time score can be written as a sum of autoregressive terms, and that we learn each term of the sum instead of the final result only. We verify in section 6 that this approach empirically leads to better performance. Formally, define the vectorization of the conditional time score as the result of stacking its components as vec( t log pt(x | z)) = [ t log pt(xi| x