# minde_mutual_information_neural_diffusion_estimation__1b84e4fc.pdf Published as a conference paper at ICLR 2024 MINDE: MUTUAL INFORMATION NEURAL DIFFUSION ESTIMATION Giulio Franzese1, , Mustapha Bounoua1,2, Pietro Michiardi1 1EURECOM,2Ampere Software Technology, giulio.franzese@eurecom.fr In this work we present a new method for the estimation of Mutual Information (MI) between random variables. Our approach is based on an original interpretation of the Girsanov theorem, which allows us to use score-based diffusion models to estimate the Kullback-Leibler (KL) divergence between two densities as a difference between their score functions. As a by-product, our method also enables the estimation of the entropy of random variables. Armed with such building blocks, we present a general recipe to measure MI, which unfolds in two directions: one uses conditional diffusion process, whereas the other uses joint diffusion processes that allow simultaneous modelling of two random variables. Our results, which derive from a thorough experimental protocol over all the variants of our approach, indicate that our method is more accurate than the main alternatives from the literature, especially for challenging distributions. Furthermore, our methods pass MI self-consistency tests, including data processing and additivity under independence, which instead are a pain-point of existing methods. Code available. 1 INTRODUCTION Mutual Information (MI) is a central measure to study the non-linear dependence between random variables [Shannon, 1948; Mac Kay, 2003], and has been extensively used in machine learning for representation learning [Bell & Sejnowski, 1995; Stratos, 2019; Belghazi et al., 2018; Oord et al., 2018; Hjelm et al., 2019], and for both training [Alemi et al., 2016; Chen et al., 2016; Zhao et al., 2018] and evaluating generative models [Alemi & Fischer, 2018; Huang et al., 2020]. For many problems of interest, precise computation of MI is not an easy task [Mc Allester & Stratos, 2020; Paninski, 2003], and a wide range of techniques for MI estimation have flourished. As the application of existing parametric and non-parametric methods [Pizer et al., 1987; Moon et al., 1995; Kraskov et al., 2004; Gao et al., 2015] to realistic, high-dimensional data is extremely challenging, if not unfeasible, recent research has focused on variational approaches [Barber & Agakov, 2004; Nguyen et al., 2007; Nowozin et al., 2016; Poole et al., 2019; Wunder et al., 2021; Letizia et al., 2023; Federici et al., 2023] and neural estimators [Papamakarios et al., 2017; Belghazi et al., 2018; Oord et al., 2018; Song & Ermon, 2019; Rhodes et al., 2020; Letizia & Tonello, 2022; Brekelmans et al., 2022] for MI estimation. In particular, the work by Song & Ermon [2019] and Federici et al. [2023] classify recent MI estimation methods into discriminative and generative approaches. The first class directly learns to estimate the ratio between joint and marginal densities, whereas the second estimates and approximates them separately. In this work, we explore the problem of estimating MI using generative approaches, but with an original twist. In 2 we review diffusion processes [Song et al., 2021] and in 3 we explain how, thanks to the Girsanov Theorem [Øksendal, 2003], we can leverage score functions to compute the KL divergence between two distributions. This also enables the computation of the entropy of a random variable. In 4 we present our general recipe for computing the MI between two arbitrary distributions, which we develop according to two modeling approaches, i.e., conditional and joint diffusion processes. The conditional approach is simple and capitalizes on standard diffusion models, but it is inherently more rigid, as it requires one distribution to be selected as the conditioning signal. Joint diffusion processes, on the other hand, are more flexible, but require an extension of traditional diffusion models, which deal with dynamics that allow data distributions to evolve according to multiple arrows of time. Published as a conference paper at ICLR 2024 Recent work by Czy z et al. [2023] argue that MI estimators are mostly evaluated assuming simple, multivariate normal distributions for which MI is analytically tractable, and propose a novel benchmark that introduces several challenges for estimators, such as sparsity of interactions, long-tailed distributions, invariance, and high mutual information. Furthermore, Song & Ermon [2019] introduce measures of self-consistency (additivity under independence and the data processing inequality) for MI estimators, to discern the properties of various approaches. In 5 we evaluate several variants of our method, which we call Mutual Information Neural Diffusion Estimation (MINDE), according to such challenging benchmarks: our results show that MINDE outperforms the competitors on a majority of tasks, especially those involving challenging data distributions. Moreover, MINDE passes all self-consistency tests, a property that has remained elusive so far, for existing neural MI estimators. 2 DIFFUSION PROCESSES AND SCORE FUNCTIONS We now revisit the theoretical background on diffusion processes, which is instrumental for the derivation of the methodologies proposed in this work. Consider the real space RN and its associated Borel σ algebra, defining the measurable space RN, B(RN) . In this work, we consider Ito processes in RN with duration T < . Let Ω= D [0, T] RN , be the space of all N dimensional continuous functions in the interval [0, T], and the filtration F induced by the canonical process Xt(ω) = ωt, ω Ω. As starting point, we consider an Ito process: d Xt = ft Xtdt + gtd Wt, X0 = x (1) with given continuous functions ft 0, gt > 0 and an arbitrary (deterministic) initial condition x RN. Equivalently, we can say that initial conditions are drawn from the Dirac measure δx. This choice completely determines the path measure Pδx of the corresponding probability space Ω, F, Pδx . Starting from Pδx we construct a new path measure Pµ by considering the product between Pδx and measure µ in RN: RN Pδxdµ(x). (2) Conversely, the original measure Pδx can be recovered from Pµ by conditioning the latter on the particular initial value x, i.e., the projection Pδx = Pµ#x. The new measure Pµ can be represented by the following Stochastic Differential Equation (SDE): d Xt = ft Xtdt + gtd Wt, X0 µ (3) associated to the corresponding probability spaces (Ω, F, Pµ). We define νµ t as the pushforward of the complete path measure onto time instant t [0, T], where by definition νµ 0 = µ. It is instrumental for the scope of this work to study how the path measures and the SDEs representations change under time reversal. Let ˆXt def= ωT t be the time-reversed canonical process. If the canonical process Xt is represented as in Eq. (3) under the path measure Pµ, then the time reversed process ˆXt has SDE representation [Anderson, 1982]: ( d ˆXt = f T t ˆXt + g2 T tsµ T t( ˆXt)dt + g T td ˆWt, ˆX0 νµ T (4) with corresponding path-reversed measure ˆPµ, on the probability spaces with time-reversed filtration. Next, we define the score function of the densities associated to the forward processes. In particular, sµ t (x) def= log ( νµ t (x)), where νµ t (x) is the density associated to the measures νµ t (x), computed with respect to the Lebesgue measure,dνµ t (x) = νµ t (x)dx. In general we cannot assume exact knowledge of such true score function. Then, in practice, instead of the score function sµ t (x), we use parametric (θ) approximations thereof, sµ t (x), which we call the score network. Training the score network can be done by minimizing the following loss [Song et al., 2021; Huang et al., 2021; Kingma et al., 2021]: sµ t (Xt) log ν δX0 t (Xt) 2 dt Published as a conference paper at ICLR 2024 where ν δX0 t stands for the measure of the processes at time t, conditioned on some initial value X0. 3 KL DIVERGENCE AS DIFFERENCE OF SCORE FUNCTIONS The MI between two random variables can be computed according to several equivalent expressions, which rely on the KL divergence between measures and/or entropy of measures. We then proceed to describe i) how to derive KL divergence between measures as the expected difference of score functions, ii) how to estimate such divergences given parametric approximation of the scores (and the corresponding estimation errors) and iii) how to cast the proposed methodology to the particular case of entropy estimation. In summary, this Section introduces the basic building blocks that we use in 4 to define our MI estimators. We consider the KL divergence between two generic measures µA and µB in RN, i.e. KL µA µB , which is equal to R RN dµA log dµA dµB , if the Radon-Nikodym derivative dµA dµB exists (absolute con- tinuity is satisfied), and + otherwise. Since our state space is RN, the following disintegration properties are valid [L eonard, 2014]: d PµB (ω) = d PµA#ω0 d PµB#ω0 (ω)dµA(ω0) dµB(ω0) = dµA(ω0) dµB(ω0), dˆPµA dˆPµB (ω) = d ˆPµA#ωT d ˆPµB#ωT (ω)dνµA dνµB T (ωT ) , where we implicitly introduced the product representation ˆPµA = R T (x), similarly to Eq. (2). Thanks to such disintegration theorems, we can write the KL divergence between the overall path measures PµA and PµB of two diffusion processes associated to the measures µA and µB as KL h PµA PµBi = EPµA = EPµA log dµA = KL µA µB , (7) where the second equality holds because, as observed on the left of Eq. (6), when conditioned on the same initial value, the path measures of the two forward processes coincide. Now, since the KL divergence between the path measures is invariant to time reversal, i.e., KL h PµA PµBi = KL h ˆPµA ˆPµBi , using similar disintegration arguments, it holds that: KL h ˆPµA ˆPµBi = EˆPµA log d ˆPµA#ωT The first term on the r.h.s of Eq. (8) can be computed using the Girsanov theorem [Øksendal, 2003] as g2 t sµA t ( ˆXt) sµB t ( ˆXt) 2 dt sµA t (Xt) sµB t (Xt) 2 dt The second term on the r.h.s of Eq. (8), equals KL h νµA T i : this is a vanishing term with T, i.e. lim T KL h νµA T i = 0. To ground this claim, we borrow the results by Collet & Malrieu [2008], which hold for several forward diffusion SDEs of interest, such as the Variance Preserving (VP), or Variance Exploding (VE) SDEs Song et al. [2021]. In summary, it is necessary to adapt the classical Bakry Emery condition of diffusion semigroup to the non homogeneous case, and exploit the contraction properties of diffusion on the KL divergences. Combining the different results, we have that: KL µA µB = EPµA sµA t (Xt) sµB t (Xt) 2 dt Published as a conference paper at ICLR 2024 which constitutes the basic equality over which we construct our estimators, described in 3.1. We conclude by commenting on the possibility of computing divergences in a latent space. Indeed, in many natural cases, the density µA is supported on a lower dimensional manifold M RN [Loaiza Ganem et al., 2022]. Whenever we can find encoder and decoder functions ψ, ϕ, respectively, such that ϕ(ψ(x)) = x, µA almost surely, and ϕ(ψ(x)) = x, µB almost surely, the KL divergence can be computed in the latent space obtained by the encoder ψ. Considering the pushforward measure µA ψ 1, it is indeed possible to show (proof in A) that KL µA µB = KL µA ψ 1 µB ψ 1 . This property is particularly useful as it allows using score based models trained in a latent space to compute the KL divergences of interest, as we do in 5.2. 3.1 KL ESTIMATORS AND THEORETICAL GUARANTEES Given the parametric approximations of the score networks through minimization of Eq. (5), and the result in Eq. (10), we are ready to discuss our proposed estimator of the KL divergence. We focus on the first term on the r.h.s. of Eq. (10), which has unknown value, and define its approximated version e(µA, µB) def= EPµA sµA t (Xt) sµB t (Xt) 2 dt g2 t 2 EνµA t sµA t (Xt) sµB t (Xt) 2 dt, (11) where parametric scores, instead of true score functions, are used. By defining the score error as ϵµA t (x) def= sµA t (x) sµA t (x), it is possible to show (see A) that e(µA, µB) g2 t 2 sµA t (Xt) sµB t (Xt) 2 dt has expression ϵµA t (Xt) ϵµB t (Xt) 2 + 2 sµA t (Xt) sµB t (Xt), ϵµA t (Xt) ϵµB t (Xt) dt As for the second term on the r.h.s. of Eq. (10), KL h νµA T i , we recall that it is a quantity that vanishes with large T. Consequently, given a sufficiently large diffusion time T the function e serves as an accurate estimator of the true KL: e(µA, µB) = KL µA µB + d KL h νµA T i KL µA µB . (13) An important property of our estimator is that it is neither an upper nor a lower bound of the true KL divergence: indeed the d term of Eq. (13) can be either positive or negative. This property, frees our estimation guarantees from the pessimistic results of Mc Allester & Stratos [2020]. Note also that, counter-intuitively, larger errors norms ϵµA t (x) not necessarily imply larger estimation error of the KL divergence. Indeed, common mode errors (reminiscent of paired statistical tests) cancel out. In the special case where ϵµA t (x) = ϵµB t (x), the estimation error due to the approximate nature of the score functions is indeed zero. Accurate quantification of the estimation error is, in general, a challenging task. Indeed, techniques akin to the works [De Bortoli, 2022; Lee et al., 2022; Chen et al., 2022], where guarantees are provided w.r.t. to the distance between the real backward dynamics and the measures induced by the simulated backward dynamics, KL µA µA , are not readily available in our context. Qualitatively, we observe that our estimator is affected by two sources of error: score networks that only approximate the true score function and finiteness of T. The d term in Eq. (13), which is related to the score discrepancy, suggests selection of a small time T (indeed we can expect such mismatch to behave as a quantity that increases with T [Franzese et al., 2023]). It is important however to adopt a sufficiently large diffusion time T such that KL h νµA T i is negligible. Typical diffusion schedules satisfy these requirements. Note that, if the KL term is known (or approximately known), it can be included in the definition of the estimator function, reducing the estimation error (see also discussion in 3.2). Published as a conference paper at ICLR 2024 Montecarlo Integration The analytical computation of Eq. (11) is, in general, out of reach. However, Montecarlo integration is possible, by recognizing that samples from νµA t can be obtained through the sampling scheme X0 µA, Xt ν δX0 t . The outer integration w.r.t. to the time instant is similarly possible by sampling t U(0, T), and multiplying the result of the estimation by T (since R T 0 ( )dt = TEt U(0,T )[( )]). Alternatively, it is possible to implement importance sampling schemes to reduce the variance, along the lines of what described by Huang et al. [2021], by sampling the time instant non-uniformly and modifying accordingly the time-varying constants in Eq. (11). In both cases, the Montecarlo estimation error can be reduced to arbitrary small values by collecting enough samples, with guarantees described in [Rainforth et al., 2018]. 3.2 ENTROPY ESTIMATION We now describe how to compute the entropy associated to a given density µA, H(µA) def= R dµA(x) log µA(x). Using the ideas for estimating the KL divergence, we notice that we can compute KL µA γσ , where γσ(x) stands for the standard Gaussian distribution with mean 0 and covariance σ2I. Then, we can relate the entropy to such divergence through the following equality: H(µA) + KL µA γσ = Z dµA(x) log γσ(x) = N 2 log 2πσ2 + EµA X2 0 A simple manipulation of Eq. (14), using the results from 3.1, implies that the estimation of the entropy H(µA) involves three unknown terms: e(µA, γσ), KL h νµA T νγσ T i and EµA[X2 0] 2σ2 . Now, the score function associated to the forward process starting from γσ is analytically known and has value sγσ t (x) = χ 1 t x, where χt = k2 t σ2 + k2 t R t 0 k 2 s g2 sds I, with kt = exp n R t 0 fsds o . More- over, whenever T is large enough νµA T γ1, independently on the chosen value of σ. Consequently T νγσ T i KL γ1 γ χT , which is analytically available as N/2 (log (χT ) 1 + 1/χT). Quantification of such approximation is possible following the same lines defined by Collet & Malrieu [2008]. In summary, we consider the following estimator for the entropy: 2 log 2πσ2 + EµA X2 0 2σ2 e(µA, γσ) N log (χT ) 1 + 1 For completeness, we note that a related estimator has recently appeared in the literature [Kong et al., 2022], although the technical derivation and objectives are different than ours. 4 COMPUTATION OF MUTUAL INFORMATION In this work, we are interested in estimating the MI between two random variables A, B. Consequently, we need to define the joint, conditional and marginal measures. We consider the first random variable A in RN to have marginal measure µA. Similarly, we indicate the marginal measure of the second random variable B with µB. The joint measure of the two random variables C def= [A, B], which is defined in R2N, is indicated with µC. What remains to be specified are the conditional measures of the first variable given a particular value of the second A | B = y, shortened with Ay, that we indicate with the measure µAy, and the conditional measure of the second given a particular value of the first, B | A = x, shortened with Bx, and indicated with µBx. This choice of notation, along with Bayes theorem, implies the following set of equivalences: dµC(x, y) = dµAy(x)dµB(y) = dµBx(y)dµA(x) and µA = R µAydµB(y), µB = R µBxdµA(x). The marginal measures µA, µB are associated to diffusion of the form of Eq. (3). Similarly, the joint µC and conditional µAy measures we introduced, are associated to forward diffusion processes: ( d [Xt, Yt] = ft[Xt, Yt] dt + gt [d Wt, d W t] [X0, Y0] µC , d Xt = ft Xtdt + gtd Wt X0 µAy (16) respectively, where the SDE on the l.h.s. is valid for the real space R2N, as defined in 2. Published as a conference paper at ICLR 2024 In this work, we consider two classes of diffusion processes. In the first case, the diffusion model is asymmetric, and the random variable B is only considered as a conditioning signal. As such, we learn the score associated to the random variable A, with a conditioning signal B, which is set to some predefined null value when considering the marginal measure. This well-known approach [Ho & Salimans, 2021] effectively models the marginal and conditional scores associated to µA and µAy with a unique score network. Next, we define a new kind of diffusion model for the joint random variable C, which allows modelling the joint and the conditional measures. Inspired by recent trends in multi-modal generative modeling [Bao et al., 2023; Bounoua et al., 2023], we define a joint diffusion process that allows amortized training of a single score network, instead of considering separate diffusion processes and their respective score networks, for each random variable. To do so, we define the following SDE: ( d [Xt, Yt] = ft[αXt, βYt] dt + gt [αd Wt, βd W t] , [X0, Y0] µC, (17) with extra parameters α, β {0, 1}. This SDE extends the l.h.s. of Eq. (16), and describes the joint evolution of the variables Xt, Yt, starting from the joint measure µC, with overall path measure PµC. The two extra coefficients α, β are used to modulate the speed at which the two portions Xt, Yt of the process diffuse towards their steady state. More precisely, α = β = 1 corresponds to a classical simultaneous diffusion (l.h.s. of Eq. (16)). On the other hand, the configuration α = 1, β = 0 corresponds to the case in which the variable Yt remains constant throughout all the diffusion (which is used for conditional measures, r.h.s. of Eq. (16)). The specular case, α = 0, β = 1, similarly allows to study the evolution of Yt conditioned on a constant value of X0. Then, instead of learning three separate score networks (for µC, µAy and µBx), associated to standard diffusion processes, the key idea is to consider a unique parametric score, leveraging the unified formulation Eq. (17), which accepts as inputs two vectors in RN, the diffusion time t, and the two coefficients α, β. This allows to conflate in a single architecture: i) the score sµC t (x, y) associated to the joint diffusion of the variables A, B (corresponding to α = β = 1) and ii) the conditional score sµAy t (x) (corresponding to α = 1, β = 0). Additional details are presented in C. 4.1 MINDE: A FAMILY OF MI ESTIMATORS We are now ready to describe our new MI estimator, which we call MINDE. As a starting point, we recognize that the MI between two random variables A, B has several equivalent expressions, among which Eqs. (18) to (20). On the left hand side of these expressions we report well-known formulations for the MI, I(A, B), while on the right hand side we express them using the estimators we introduce in this work, where equality is assumed to be valid up to the errors described in 3. H(A) H(A | B) e(µA, γσ) + Z e(µAy, γσ)dµB(y), (18) Z KL h µAy µAi dµB(y) Z e(µAy, µA)dµB(y), (19) H(C) H(A | B) H(B | A) e(µC, γσ) + Z e(µAy, γσ)dµB(y) + Z e(µBx, γσ)dµA(x). (20) Note that it is possible to derive (details in B) another equality for the MI: I(A, B) EPµC sµC t ([Xt, Yt]) [ sµAY0 t (Xt), sµBX0 t (Yt)] 2 dt Next, we describe how the conditional and joint modeling approaches can be leveraged to compute a family of techniques to estimate MI. We evaluate all the variants in 5. Conditional Diffusion Models. We start by considering conditional models. A simple MI estimator can be obtained considering Eq. (18). The entropy of A can be estimated using Eq. (15). Similarly, Published as a conference paper at ICLR 2024 we can estimate the conditional entropy H(A | B) using the equality H(A | B) = R H(Ay)dµB(y), where the argument of the integral, H(Ay), can be again obtained using Eq. (15). Notice, that since EµB(y)EµAy X2 0 = EµA X2 0 , when substracting the estimators of H(A) and H(A | B), all the terms but the estimator functions e( ) cancels out, leading to the equality in Eq. (18). A second option is to simply use Eq. (19) and leverage Eq. (11). Joint diffusion models. Armed with the definition of a joint diffusion processes, and the corresponding score function, we now describe the basic ingredients that allow estimation of the MI, according to various formulations. Using the joint score function sµC t ([x, y]), the estimation of the joint entropy H(A, B) can be obtained with a straightforward application of Eq. (15). Similarly, the conditional entropy H(A | B) = R H(Ay)dµB(y) can be computed using sµAy t (x) to obtain the conditional score. Notice that H(B | A) is similarly obtained. Given the above formulations of the joint and conditional entropy, it is now easy to compute the MI according to Eq. (20), where we notice that, similarly to what discussed for conditional models, many of the terms in the different entropy estimations cancel out. Finally, it is possible to compute the MI according to Eq. (21). Interestingly, this formulation allows to eliminate the need for the parameter σ of the entropy estimators, similarly to the MINDE conditional variant, which shares this property as well (Eq. (18)). 5 EXPERIMENTAL VALIDATION We now evaluate the different estimators proposed in 4. In particular, we study conditional and joint models (MINDE-C and MINDE-J respectively), and variants that exploit the difference between the parametric scores inside the same norm ( Eqs. (19) and (21)) or outside it, adopting the difference of entropies representation along with Gaussian reference scores sγc (Eqs. (18) and (20)). Summarizing, we refer to the different variants as MINDE-C(σ), MINDE-C, and MINDE-J(σ), MINDE-J, for Eqs. (18) to (21) respectively. Our empirical validation involves a large range of synthetic distributions, which we present in 5.1. We also analyze the behavior of all MINDE variants according to self-consistency tests, as discussed in 5.2. For all the settings, we use a simple, stacked multi-layer perception (MLP) with skip connections adapted to the input dimensions, and adopt VP-SDE diffusion Song et al. [2021]. We apply importance sampling [Huang et al., 2021; Song et al., 2021] at both training and inference time. More details about the implementation are included in C. 5.1 MI ESTIMATION BENCHMARK We use the evaluation strategy proposed by Czy z et al. [2023], which covers a range of distributions going beyond what is typically used to benchmark MI estimators, e.g., multivariate normal distributions. In summary, we consider high-dimensional cases with (possibly) long-tailed distributions and/or sparse interactions, in the presence of several non trivial non-linear transformation. Benchmarks are constructed using samples from several base distributions, including Uniform, Normal with either dense or sparse correlation structure, and long-tailed Student distributions. Such samples are further modified by deterministic transformations, including the Half-Cube homeomorphism, which extends the distribution tails, and the Asinh Mapping, which instead shortens them, the Swiss Roll Embedding and Spiral diffeomorphis, which alter the simple linear structure of the base distributions. We compare MINDE against neural estimators, such as MINE [Belghazi et al., 2018], INFONCE [Oord et al., 2018],NWJ [Nguyen et al., 2007] and DOE [Mc Allester & Stratos, 2020]. To ensure a fair comparison between MINDE and other neural competitors, we consider architectures with a comparable number of parameters. Note that the original benchmark in [Czy z et al., 2023] uses 10k training samples, which are in many cases not sufficient to obtain stable estimates of the MI for our competitors. Here, we use a larger training size (100k samples) to avoid confounding factors in our analysis. In all our experiments, we fix σ = 1.0 for the MINDE-C(σ), MINDE-J(σ) variants, which results in the best performance (an ablation study is included in D). Results: The general benchmark consists of 40 tasks (10 unique tasks 4 parametrizations) designed by combining distributions and MI-invariant transformations discussed earlier. We average results over 10 seeds for MINDE variants and competitors, following the same protocol as in Czy z Published as a conference paper at ICLR 2024 et al. [2023]. We present the full set of MI estimation tasks in Table 1. As in the original Czy z et al. [2023], estimates for the different methods are presented with a precision of 0.1 nats, to improve visualization. For low-dimensional distributions, benchmark results show that all methods are effective in accurate MI estimation. Differences emerge for more challenging scenarios. Overall, all our MINDE variants perform well. MINDE C stands out as the best estimator with 35/40 estimated tasks with an error within the 0.1 nats quantization range. Moreover, MINDE can accurately estimate the MI for long tailed distributions (Student) and highly transformed distributions (Spiral, Normal CDF), which are instead problematic for most of the other methods. The MINE estimator achieves the second best performance, with an MI estimation within 0.1 nats from ground truth for 24/40 tasks. Similarly to the other neural estimator baselines, MINE is limited when dealing with long tail distributions (Student), and significantly transformed distributions (Spiral). High MI benchmark: Through this second benchmark, we target high MI distributions. We consider 3 3 multivariate normal distribution with sparse interactions as done in Czy z et al. [2023]. We vary the correlation parameter to obtain the desired MI, and test the estimators when applying Half-cube or Spiral transformations. Results in Figure 1 show that while on the non transformed distribution (column (a)) all neural estimators nicely follow the ground truth, on the transformed versions (columns (b) and (c)), MINDE outperforms competitors. (a) Sparse Multinormal (b) Half-cube Figure 1: High MI benchmark: original (column (a)) and transformed variants (columns (b) and (c)). 5.2 CONSISTENCY TESTS The second set of tests we perform are the self-consistency ones proposed in Song & Ermon [2019], which aim at investigating properties of MI estimators on real data. Considering as random variable A a sample from the MNIST (resolution 28 28) dataset, the first set of measurements performed is the estimation of I(A, Br), where Br is equal to A for the first r rows, and set to 0 afterwards. It is evident that I(A, Br) is a quantity that increases with r, where in particular I(A, B0) = 0. Testing whether this holds also for the estimated MI is referred to as independency test. The second test proposed in Song & Ermon [2019] is the data-processing test, where given that I(A; [Br+k, Br]) = I(A; Br+k), k > 0, the task is to verify it through estimators for different values of k. Finally, the additivity tests aim at assessing whether for two independent images A1, A2 extracted from the dataset, the property I([A1, A2]; [B1 r, B2 r]) = 2I(A1; B1 r) is satisfied also by the numerical estimations. For these tests, we consider diffusion models in a latent space, exploiting the invariance of KL divergences to perfect auto-encoding (see 3). First, we train for all tests deterministic auto-encoders for the considered images. Then, through concatenation of the latent variables, as done in [Bao et al., 2023; Bounoua et al., 2023], we compute the MI with the different schemes proposed in 4. Results of the three tests (averaged over 5 seeds) are reported in Figure 2. In general, all MINDE variants show excellent performance, whereas none of the other neural MI estimators succeed at passing simultaneously all tests, as can be observed from Figures 4,5,6 in the original Song & Ermon [2019]). Published as a conference paper at ICLR 2024 (a) Baseline test (b) Data processing test (c) Additivity test Figure 2: Consistency tests results on the MNIST dataset. Baseline test Figure 2a: Evaluation of I(A,Br) I(A,B0). A is an image and Br is an image containing the top t rows of A. Data processing test Figure 2b: Evaluation of I(A,[Br+k,Br)]) I(A,Br+k) (ideal value is 1). Additivity test Figure 2c: Evaluation of I([A1,A2],[B1 r,B2 r]) I(A1,B1r) (ideal value is 2). GT 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.2 0.4 0.2 0.3 0.2 0.4 0.3 0.4 1.7 0.3 0.4 MINDE J (σ = 1) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.1 1.0 1.0 1.0 0.3 0.9 1.2 1.0 0.4 1.0 0.6 1.7 0.4 1.0 1.0 1.0 0.9 0.9 0.9 1.0 0.9 1.0 0.2 0.4 0.2 0.3 0.2 0.5 0.3 0.5 1.6 0.3 0.4 MINDE J 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.2 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.7 0.4 1.1 1.0 1.0 1.0 0.9 0.9 1.1 1.0 1.0 0.1 0.2 0.2 0.3 0.2 0.5 0.3 0.4 1.7 0.3 0.4 MINDE C (σ = 1) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 1.0 0.9 0.9 0.9 0.9 1.0 0.9 0.1 0.3 0.2 0.3 0.2 0.4 0.3 0.3 1.7 0.3 0.4 MINDE C 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 1.0 1.0 1.0 0.9 0.9 0.9 1.0 1.0 1.0 0.1 0.3 0.2 0.3 0.2 0.4 0.3 0.4 1.7 0.3 0.4 MINE 0.2 0.4 0.2 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 0.9 0.9 0.9 0.8 0.7 0.6 0.9 0.9 0.9 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.4 1.7 0.3 0.4 Info NCE 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 1.0 0.8 0.8 0.8 0.9 1.0 1.0 0.2 0.3 0.2 0.3 0.2 0.4 0.3 0.4 1.7 0.3 0.4 D-V 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 1.0 0.8 0.8 0.8 0.9 1.0 1.0 0.0 0.0 0.1 0.1 0.2 0.2 0.2 0.4 1.7 0.3 0.4 NWJ 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 1.0 0.8 0.8 0.8 0.9 1.0 1.0 0.0 0.0 0.0 -0.6 0.1 0.1 0.2 0.4 1.7 0.3 0.4 Do E(Gaussian) 0.2 0.5 0.3 0.6 0.4 0.4 0.4 0.7 1.0 1.0 1.0 0.4 0.7 7.8 1.0 0.6 0.9 1.3 0.4 0.7 1.0 1.0 0.5 0.6 0.6 0.6 0.7 0.8 6.7 7.9 1.8 2.5 0.6 4.2 1.2 1.6 0.1 0.4 Do E(Logistic) 0.1 0.4 0.2 0.4 0.4 0.4 0.4 0.6 0.9 0.9 1.0 0.3 0.7 7.8 1.0 0.6 0.9 1.3 0.4 0.8 1.1 1.0 0.5 0.6 0.6 0.7 0.8 0.8 2.0 0.5 0.8 0.3 1.5 0.6 1.6 0.1 0.4 Asinh @ St 1 1 (dof=1) Asinh @ St 2 2 (dof=1) Asinh @ St 3 3 (dof=2) Asinh @ St 5 5 (dof=2) Bimodal 1 1 Bivariate Nm 1 1 Hc @ Bivariate Nm 1 1 Hc @ Mn 25 25 (2-pair) Hc @ Mn 3 3 (2-pair) Hc @ Mn 5 5 (2-pair) Mn 2 2 (2-pair) Mn 2 2 (dense) Mn 25 25 (2-pair) Mn 25 25 (dense) Mn 3 3 (2-pair) Mn 3 3 (dense) Mn 5 5 (2-pair) Mn 5 5 (dense) Mn 50 50 (dense) Nm CDF @ Bivariate Nm 1 1 Nm CDF @ Mn 25 25 (2-pair) Nm CDF @ Mn 3 3 (2-pair) Nm CDF @ Mn 5 5 (2-pair) Sp @ Mn 25 25 (2-pair) Sp @ Mn 3 3 (2-pair) Sp @ Mn 5 5 (2-pair) Sp @ Nm CDF @ Mn 25 25 (2-pair) Sp @ Nm CDF @ Mn 3 3 (2-pair) Sp @ Nm CDF @ Mn 5 5 (2-pair) St 1 1 (dof=1) St 2 2 (dof=1) St 2 2 (dof=2) St 3 3 (dof=2) St 3 3 (dof=3) St 5 5 (dof=2) St 5 5 (dof=3) Swiss roll 2 1 Uniform 1 1 (additive noise=0.1) Uniform 1 1 (additive noise=0.75) Wiggly @ Bivariate Nm 1 1 Table 1: Mean MI estimates over 10 seeds using N = 10k test samples against ground truth (GT). Color indicates relative negative (red) and positive bias (blue). All methods were trained with 100k samples. List of abbreviations ( Mn: Multinormal, St: Student-t, Nm: Normal, Hc: Half-cube, Sp: Spiral) 6 CONCLUSION The estimation of MI stands as a fundamental goal in many areas of machine learning, as it enables understanding the relationships within data, driving representation learning, and evaluating generative models. Over the years, various methodologies have emerged to tackle the difficult task of MI estimation, addressing challenges posed by high-dimensional, real-world data. Our work introduced a novel method, MINDE, which provides a unique perspective on MI estimation by leveraging the theory of diffusion-based generative models. We expanded the classical toolkit for information-theoretic analysis, and showed how to compute the KL divergence and entropy of random variables using the score of data distributions. We defined several variants of MINDE, which we have extensively tested according to a recent, comprehensive benchmark that simulates real-world challenges, including sparsity, long-tailed distributions, invariance to transformations. Our results indicated that our methods outperform state-of-the-art alternatives, especially on the most challenging tests. Additionally, MINDE variants successfully passed self-consistency tests, validating the robustness and reliability of our proposed methodology. Our research opens up exciting avenues for future exploration. One compelling direction is the application of MINDE to large-scale multi-modal datasets. The conditional version of our approach enables harnessing the extensive repository of existing pre-trained diffusion models. For instance, it could find valuable application in the estimation of MI for text-conditional image generation. Conversely, our joint modeling approach offers a straightforward path to scaling MI estimation to more than two variables. A scalable approach to MI estimation is particularly valuable when dealing with complex systems involving multiple interacting variables, eliminating the need to specify a hierarchy among them. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENTS GF and PM gratefully acknowledges support from Huawei Paris and the European Commission (ADROIT6G Grant agreement ID: 101095363). Alexander A Alemi and Ian Fischer. Gilbo: One metric to measure them all. Advances in Neural Information Processing Systems, 31, 2018. Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. In International Conference on Learning Representations, 2016. Brian D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982. Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. ar Xiv preprint ar Xiv:2211.01324, 2022. Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. ar Xiv preprint ar Xiv:2303.06555, 2023. David Barber and Felix Agakov. The im algorithm: a variational approach to information maximization. Advances in neural information processing systems, 16(320):201, 2004. Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In Proceedings of the 35th International Conference on Machine Learning, 2018. Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129 1159, 1995. Mustapha Bounoua, Giulio Franzese, and Pietro Michiardi. Multi-modal latent diffusion. ar Xiv preprint ar Xiv:2306.04445, 2023. Rob Brekelmans, Sicong Huang, Marzyeh Ghassemi, Greg Ver Steeg, Roger Baker Grosse, and Alireza Makhzani. Improving mutual information estimation with annealed and energy-based bounds. In International Conference on Learning Representations, 2022. Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In International Conference on Learning Representations, 2022. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016. Jean-Franc ois Collet and Florent Malrieu. Logarithmic sobolev inequalities for inhomogeneous markov semigroups. ESAIM: Probability and Statistics, 12:492 504, 2008. Paweł Czy z, Frederic Grabowski, Julia E Vogt, Niko Beerenwinkel, and Alexander Marx. Beyond normal: On the evaluation of mutual information estimators. Advances in Neural Information Processing Systems, 2023. Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022. Marco Federici, David Ruhe, and Patrick Forr e. On the effectiveness of hybrid mutual information estimation. ar Xiv preprint ar Xiv:2306.00608, 2023. Published as a conference paper at ICLR 2024 Giulio Franzese, Simone Rossi, Lixuan Yang, Alessandro Finamore, Dario Rossi, Maurizio Filippone, and Pietro Michiardi. How much is enough? a study on diffusion times in score-based generative models. Entropy, 2023. Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient estimation of mutual information for strongly dependent variables. In Artificial intelligence and statistics, pp. 277 286. PMLR, 2015. R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019. Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In Neur IPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851. Curran Associates, Inc., 2020. Chin-Wei Huang, Jae Hyun Lim, and Aaron C Courville. A variational perspective on diffusion-based generative models and score matching. Advances in Neural Information Processing Systems, 34: 22863 22876, 2021. Sicong Huang, Alireza Makhzani, Yanshuai Cao, and Roger Grosse. Evaluating lossy compression rates of deep generative models. In International Conference on Machine Learning. PMLR, 2020. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. Xianghao Kong, Rob Brekelmans, and Greg Ver Steeg. Information-theoretic diffusion. In International Conference on Learning Representations, 2022. Alexander Kraskov, Harald St ogbauer, and Peter Grassberger. Estimating mutual information. Physical review E, 69(6):066138, 2004. Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative modeling with polynomial complexity. Advances in Neural Information Processing Systems, 35:22870 22882, 2022. Christian L eonard. Some properties of path measures. S eminaire de Probabilit es XLVI, pp. 207 230, 2014. Nunzio A Letizia and Andrea M Tonello. Copula density neural estimation. ar Xiv preprint ar Xiv:2211.15353, 2022. Nunzio A Letizia, Nicola Novello, and Andrea M Tonello. Variational f-divergence and derangements for discriminative mutual information estimation. ar Xiv preprint ar Xiv:2305.20025, 2023. Gabriel Loaiza-Ganem, Brendan Leigh Ross, Jesse C Cresswell, and Anthony L Caterini. Diagnosing and fixing manifold overfitting in deep generative models. Transactions on Machine Learning Research, 2022. David JC Mac Kay. Information theory, inference and learning algorithms. Cambridge university press, 2003. David Mc Allester and Karl Stratos. Formal limitations on the measurement of mutual information. In International Conference on Artificial Intelligence and Statistics, 2020. Young-Il Moon, Balaji Rajagopalan, and Upmanu Lall. Estimation of mutual information using kernel density estimators. Physical Review E, 52(3):2318, 1995. Published as a conference paper at ICLR 2024 Xuan Long Nguyen, Martin J Wainwright, and Michael Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in Neural Information Processing Systems, 2007. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. F-gan: Training generative neural samplers using variational divergence minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pp. 271 279, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819. Bernt Øksendal. Stochastic differential equations. Springer, 2003. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. Advances in neural information processing systems, 2018. Liam Paninski. Estimation of entropy and mutual information. Neural computation, 15(6):1191 1253, 2003. George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30, 2017. Stephen M Pizer, E Philip Amburn, John D Austin, Robert Cromartie, Ari Geselowitz, Trey Greer, Bart ter Haar Romeny, John B Zimmerman, and Karel Zuiderveld. Adaptive histogram equalization and its variations. Computer vision, graphics, and image processing, 39(3):355 368, 1987. Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. In International Conference on Machine Learning, 2019. Tom Rainforth, Rob Cornish, Hongseok Yang, Andrew Warrington, and Frank Wood. On nesting monte carlo estimators. In International Conference on Machine Learning, pp. 4267 4276. PMLR, 2018. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents, 2022. URL https://arxiv.org/abs/2204. 06125. Benjamin Rhodes, Kai Xu, and Michael U Gutmann. Telescoping density-ratio estimation. Advances in neural information processing systems, 2020. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684 10695, June 2022. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022. Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379 423, 1948. Jiaming Song and Stefano Ermon. Understanding the limitations of variational mutual information estimators. In International Conference on Learning Representations, 2019. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. Published as a conference paper at ICLR 2024 Karl Stratos. Mutual information maximization for simple and accurate part-of-speech induction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019. Gerhard Wunder, Benedikt Groß, Rick Fritschek, and Rafael F Schaefer. A reverse jensen inequality result with application to mutual information estimation. In 2021 IEEE Information Theory Workshop (ITW), 2021. Shengjia Zhao, Jiaming Song, and Stefano Ermon. A lagrangian perspective on latent variable generative models. In Proc. 34th Conference on Uncertainty in Artificial Intelligence, 2018. Published as a conference paper at ICLR 2024 MINDE: MUTUAL INFORMATION NEURAL DIFFUSION ESTIMATION SUPPLEMENTARY MATERIAL A PROOFS OF 3 Proof of Auto-encoder invariance of KL. Whenever we can find encoder and decoder functions ϕ, ψ respectively such that ϕ(ψ(x)) = x, µA almost surely and ϕ(ψ(x)) = x, µB almost surely, the Kullback-Leibler divergence can be computed in the latent space obtained by the encoder ψ: KL µA µB = Z dµB dµA = Z dµB ϕ ψ dµA = Z ψ(M) log dµA dµB ϕ d µA ψ 1 = ψ(M) log dµA dµB ψ( 1) d µA ψ 1 = KL µA µB . (22) Proof of Eq. (12). To prove such claim, it is sufficient to start from the r.h.s. of Eq. (11), substitute to the parametric scores their definition with the errors ϵµA t (x) = sµA t (x) sµA t (x), and expand the square: g2 t 2 EνµA t sµA t (Xt) sµB t (Xt) 2 dt = g2 t 2 EνµA t sµA t (Xt) + ϵµA t (x) sµB t (Xt) ϵµB t (x) 2 dt = g2 t 2 EνµA t sµA t (Xt) sµB t (Xt) 2 dt+ g2 t 2 EνµA t ϵµA t (Xt) ϵµB t (Xt) 2 + 2 sµA t (Xt) sµB t (Xt)+, ϵµA t (Xt) ϵµB t (Xt) dt, from which the definition of d holds. B PROOF OF EQ. (21) We start with the approximation of Eq. (20): I(A, B) e(µC, γσ) + Z e(µAy, γσ)dµB(y) + Z e(µBx, γσ)dµA(x). (23) Since the approximation is valid for any σ, we select the limit of σ , where the reference score χ 1 t x converges to zero, and can thus be neglected from the estimators integral (for example, Published as a conference paper at ICLR 2024 e(µA, γ ) TR g2 t 2 EνµA t sµA t (Xt) 2 dt). This allows to obtain: Z dνµC t ([x, y]) sµC t ([x, y]) 2 dt+ Z dνµAy t (x) sµAy t (x) 2 dt Z dνµBx t (y) sµBx t (y) 2 dt As a further step in the derivation of our approximation, we consider the estimated scores to be sufficiently good, such that we substitute the parametric with the true scores. In such case, the following holds: Z dµC([x0, y0])dν δ[x0,y0] t ([x, y]) sµC t ([x, y]) 2 + sµAy0 t (x) 2 + sµBx0 t (y) 2 dt = Z dµC([x0, y0])dν δ[x0,y0] t ([x, y]) sµC t ([x, y]) 2 + [sµAy0 t (x), sµBx0 t (y)] 2 dt = Z dµC([x0, y0])dν δ[x0,y0] t ([x, y]) 2 sµC t ([x, y]) 2 + sµC t ([x, y]) [sµAy0 t (x), sµBx0 t (y)] 2 + 2 D sµC t ([x, y]), [sµAy0 t (x), sµBx0 t (y)] E dt. Recognizing that the term sµC t ([x, y]) [sµAy0 t (x), sµBx0 t (y)] 2 , averaged over the measures, is just Eq. (21) in disguise, what remain to be assessed is the following: Z dµC([x0, y0])dν δ[x0,y0] t ([x, y]) 2 sµC t ([x, y]) 2 + 2 D sµC t ([x, y]), [sµAy0 t (x), sµBx0 t (y)] E dt = 0. (24) In particular, we focus on the term: Z dµC([x0, y0])dν δ[x0,y0] t ([x, y]) D sµC t ([x, y]), [sµAy0 t (x), sµBx0 t (y)] E dt = D sµC t ([x, y]), x0,y0 dµC([x0, y0])dν δ[x0,y0] t ([x, y])sµAy0 t (x), Z x0,y0 dµC([x0, y0])dν δ[x0,y0] t ([x, y])sµBx0 t (y) dt. Published as a conference paper at ICLR 2024 Since dν δ[x0,y0] t ([x, y]) = dν δx0 t (x)dν δy0 t (y) and dµC([x0, y0]) = dµAy0(x0)dµB(y0), then R x0 dµC([x0, y0])dν δ[x0,y0] t ([x, y]) = dνµAy0 t (x)dν δy0 t (y)dµB(y0). Consequently: Z x0,y0 dµC([x0, y0])dν δ[x0,y0] t ([x, y])sµAy0 t (x) = Z y0 dνµAy0 t (x)dν δy0 t (y)dµB(y0)sµAy0 t (x) = y0 dνµAy0 t (x)dν δy0 t (y)dµB(y0) log νµAy0 t (x) = Z y0 dνµAy0 t (x)dν δy0 t (y)dµB(y0) νµAy0 t (x) νµAy0 t (x) = y0 dν δy0 t (y)dµB(y0) νµAy0 t (x) = dx Z y0 dν δy0 t (y)dµB(y0) νµAy0 t (x) = dxdνµB t (y) Z y0 dµB | Yt=y(y0) νµAy0 t (x) = dxdνµB t (y) νµA | Yt=y t (x) , where in the last line we introduced: µB | Yt=y(y0), the measure of the random variable B conditionally on the fact that the diffused variable B after a time t is equal to y and νµA | Yt=y, the conditional measure of the diffused variable A at time t, conditionally on the diffused variable B after a time t equal to y. Finally dxdνµB t (y) νµA | Yt=y t (x) = νµA | Yt=y t (x)dxdνµB t (y) νµA | Yt=y t (x) νµA | Yt=y t (x) = dνµC t ([x, y])sµA | Yt=y t (x). Along the same lines, we can prove the equality R x0,y0 dµC([x0, y0])dν δ[x0,y0] t ([x, y])sµBx0 t (y) = dνµC t ([x, y])sµB | Xt=x t (y). Then, restarting from Eq. (25) we have: D sµC t ([x, y]), x0,y0 dµC([x0, y0])dν δ[x0,y0] t ([x, y])sµAy0 t (x), Z x0,y0 dµC([x0, y0])dν δ[x0,y0] t ([x, y])sµBx0 t (y) dt = D sµC t ([x, y]), h dνµC t ([x, y])sµA | Yt=y t (x), dνµC t ([x, y])sµB | Xt=x t (y) i E dt = x,y dνµC t ([x, y]) D sµC t ([x, y]), [sµA | Yt=y t (x), sµB | Xt=x t (y)] E dt = x,y dνµC t ([x, y]) sµC t ([x, y]) 2 dt, which finally allows to prove Eq. (24) and claim validity of Eq. (21). C IMPLEMENTATION DETAILS In this Section, we provide additional technical details of MINDE. We discuss the different variants of our method their implementation details, including detailed information about the MI estimators alternatives considered in 5. C.1 MINDE-C In all experiments, we consider the first variable as the main variable and the second variable as the conditioning signal. A single neural network is used to model the conditional and unconditional score. It accepts as inputs the two variables, the diffusion time t, and an additionally binary input c which enable the conditional mode. To enable the conditional mode, we set c = 1 and feed the network with both the main variable and the conditioning signal, obtaining s µA Y0 t . To obtain the marginal score sµA t , we set c = 0 and the conditioning signal is set to zero value. Published as a conference paper at ICLR 2024 Algorithm 1: MINDE C (Single Training Step) Data: [X0, Y0] µC parameter :netθ(), with θ current parameters t U[0, T] // Importance sampling can be used to reduce variance Xt kt X0 + k2 t R t 0 k 2 s g2 sds 1 2 ϵ, with ϵ γ1 // r.h.s. of Eq. (16), diffuse the variable X to timestep t c Bernoulli(d) // Sample binary variable c with probability d if c = 0 then ˆϵ (k2 t R t 0 k 2 s g2sds) 1 2 netθ([Xt, 0], t, c = 0) // Estimated unconditional score ˆϵ (k2 t R t 0 k 2 s g2sds) 1 2 netθ([Xt, Y0], t, c = 1)) // Estimated conditional score L = g2 t (k2 t R t 0 k 2 s g2sds) ϵ ˆϵ 2 // Compute Montecarlo sample associated to Eq. (5) return Update θ according to gradient of L Algorithm 2: MINDE C Data: [X0, Y0] µC parameter :σ, option t U[0, T] // Importance sampling can be used to reduce variance Xt kt X0 + k2 t R t 0 k 2 s g2 sds 1 2 ϵ, with ϵ γ1 // r.h.s. of Eq. (16), diffuse the variable X to timestep t sµA t netθ([Xt, 0], t, c = 0) // Use the unique score network to compute s µA Y0 t netθ([Xt, Y0], t, c = 1)) // marginal and conditional scores if option = 1 then ˆI T g2 t 2 sµA t sµAY0 t 2 χt k2 t σ2 + k2 t R t 0 k 2 s g2 sds ˆI T g2 t 2 2 sµAY0 t + Xt A randomized procedure is used for training. For each training step, with probability d, the main variable is diffused and the score network is fed with the diffused variable, the conditioning variable, the diffusion time signal and the conditioning signal is set to c = 1. On the contrary, with probability 1 d, to enable the network to learn the unconditional score, the network is fed only with the diffused modality, the diffusion time and c = 0. In contrast to the first case, the conditioning is not provided to the score network and replaced with a zero value vector. Pseudocode is presented in Algorithm 1. Actual estimation of the MI is then possible either by leveraging Eq. (18) or Eq. (19), referred to in the main text as difference outside or inside the score respectively (MINDE-C(σ), MINDE-C). A pseudo-code description is provided in Algorithm 2. C.2 MINDE-J The joint variant of our method, MINDE-J is based on the parametrized joint processes in Eq. (17). Also in this case, instead of training a separate score network for each possible combination of conditional modalities, we use a single architecture that accepts both variables, the diffusion time t and the coefficients α, β. This approach allows modelling the joint score network sµC t by setting Published as a conference paper at ICLR 2024 α = β = 1. Similarly, to obtain the conditional scores it is sufficient to set α = 1, β = 0 or α = 0, β = 1, corresponding to s µA Y0 t and s µA X0 t respectively. Training is carried out again through a randomized procedure. At each training step, with probability d, both variables are diffused. In this case, the score network is fed with diffusion time t, along with Xt, Yt and the two parameters α = β = 1. With probability 1 d, instead, we randomly select one variable to be diffused, while we keeping constant the other. For instance, if A is the one which is diffused, we set α = 1 and β = 0. Further details are presented in Algorithm 3. Once the score network is trained, MI estimation can be obtained following the procedure explained in Algorithm 4. Two options are possible, either by computing the difference between the parametric scores outside the same norm (Eq. (20) MINDE-J(σ) or inside (Eq. (21) MINDE-J). Similarly to the conditional case, an option parameter can be used to switch among the two. Algorithm 3: MINDE J (Single Training Step) Data: [X0, Y0] µC parameter :netθ(), with θ current parameters t U[0, T] // Importance sampling can be used to reduce variance [Xt, Yt] kt[X0, Y0] + k2 t R t 0 k 2 s g2 sds 1 2 [ϵ1, ϵ2], with ϵ1,2 γ1 // l.h.s. Eq. (16), diffuse modalities to timestep t c Bernoulli(d) // Sample binary variable c with probability d if c = 0 then [ˆϵ1,ˆϵ2] (k2 t R t 0 k 2 s g2sds) 1 2 netθ([Xt, Yt], t, [1, 1]) // Estimated unconditional score L = g2 t (k2 t R t 0 k 2 s g2sds) [ϵ1, ϵ2] [ ˆϵ1, ˆϵ2] 2 // Compute Montecarlo sample associated to Eq. (5) else if Bernoulli(0.5) then ˆϵ1 (k2 t R t 0 k 2 s g2sds) 1 2 netθ([Xt, Y0], t, [1, 0]) // Estimated Conditional score L = g2 t (k2 t R t 0 k 2 s g2 sds) ϵ1 ˆϵ1 2 ˆϵ2 (k2 t R t 0 k 2 s g2sds) 1 2 netθ([X0, Yt], t, [0, 1]) // Estimated Conditional score L = g2 t (k2 t R t 0 k 2 s g2sds) ϵ2 ˆϵ2 2 return Update θ according to gradient of L C.3 TECHNICAL SETTINGS FOR MINDE-C AND MINDE-J We follow the implementation of Bounoua et al. [2023] which uses stacked multi-layer perception (MLP) with skip connections. We adopt a simplified version of the same score network architecture: this involves three Residual MLP blocks. We use the Adam optimizer [Kingma & Ba, 2015] for training and Exponential moving average (EMA) with a momentum parameter m = 0.999. We use importance sampling at train and test-time. We returned the mean estimate on the test data set over 10 runs. The hyper-parameters are presented in Table 2 and Table 3 for MINDE-J and MINDE-C respectively. Concerning the consistency tests ( 5.2), we independently train an autoencoder for each version of the MNIST dataset with r rows available. Published as a conference paper at ICLR 2024 Algorithm 4: MINDE J Data: [X0, Y0] µC parameter :σ, option t U[0, T] // Importance sampling can be used to reduce variance [Xt, Yt] kt[X0, Y0] + k2 t R t 0 k 2 s g2 sds 1 2 [ϵ1, ϵ2], with ϵ1,2 γ1 // l.h.s. Eq. (16), diffuse modalities to timestep t sµC t netθ([Xt, Yt], t, [1, 1]) // Use the unique score network to compute joint s µA Y0 t netθ([Xt, Y0], t, [1, 0]) // and conditional scores s µA X0 t netθ([X0, Yt], t, [0, 1]) if option = 1 then ˆI T g2 t 2 sµC t [ sµAY0 t , sµBX0 t ] 2 χt k2 t σ2 + k2 t R t 0 k 2 s g2 sds ˆI T g2 t 2 sµC t + [Xt,Yt] 2 sµAY0 t + Xt 2 sµBX0 t + Yt Table 2: MINDE-J score network training hyper-parameters. Dim of the task correspond the sum of the two variables dimensions, whereas d corresponds to the randomization probability. d Width Time embed Batch size Lr Iterations Number of params Benchmark (Dim 10) 0.5 64 64 128 1e-3 234k 55490 Benchmark (Dim = 50) 0.5 128 128 256 2e-3 195k 222100 Benchmark (Dim = 100) 0.5 256 256 256 2e-3 195k 911204 Consistency tests 0.5 256 256 64 1e-3 390k 1602080 C.4 NEURAL ESTIMATORS IMPLEMENTATION We use the package benchmark-mi1 implementation to study the neural estimators. We use MLP architecture with 3 layers of the same width as in MINDE. We use the same training procedure as in Czy z et al. [2023], including early stopping strategy. We return the highest estimate on the test data. D ABLATIONS STUDY D.1 σ ABLATION STUDY We hereafter report in Table 4 the results of all the variants of MINDE, including different values of σ parameter. For completeness in our experimental campaign, we report also the results of non neural competitors, similarly to the work in Czy z et al. [2023]. In summary, the MINDE-C/J versions ( difference inside ) of our estimator prove to be more robust than the MINDE-C/J(σ) ( difference outside ) counterpart, especially for the joint variants. Nevertheless, it is interesting to notice that the difference outside variants are stable and competitive against a very wide range of values of σ (ranging from 0.5 to 10), with their best value typically achieved for σ = 1.0. 1https://github.com/cbg-ethz/bmi Published as a conference paper at ICLR 2024 Table 3: MINDE-C score network training hyper-parameters. Dim of the task correspond the sum of the two variables dimensions, and d corresponds to the randomization probability. d Width Time embed Batch size Lr Iterations Number of params Benchmark (Dim 10) 0.5 64 64 128 1e-3 390k 55425 Benchmark (Dim = 50) 0.5 128 128 256 2e-3 290k 220810 Benchmark (Dim = 100) 0.5 256 256 256 2e-3 290k 898354 Consistency tests 0.5 256 256 64 1e-3 390k 1597968 Published as a conference paper at ICLR 2024 GT 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.2 0.4 0.2 0.3 0.2 0.4 0.3 0.4 1.7 0.3 0.4 MINDE j 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.2 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.7 0.4 1.1 1.0 1.0 1.0 0.9 0.9 1.1 1.0 1.0 0.1 0.2 0.2 0.3 0.2 0.5 0.3 0.4 1.7 0.3 0.4 MINDE c 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 1.0 1.0 1.0 0.9 0.9 0.9 1.0 1.0 1.0 0.1 0.3 0.2 0.3 0.2 0.4 0.3 0.4 1.7 0.3 0.4 MINDE j (σ = 0.5) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.3 1.0 1.1 1.0 0.3 0.9 1.1 1.0 0.4 1.0 0.6 1.7 0.4 1.0 1.0 1.0 0.8 0.9 0.9 1.0 1.0 1.0 0.2 0.3 0.2 0.3 0.2 0.5 0.3 0.5 1.7 0.3 0.4 MINDE j (σ = 1) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.1 1.0 1.0 1.0 0.3 0.9 1.2 1.0 0.4 1.0 0.6 1.7 0.4 1.0 1.0 1.0 0.9 0.9 0.9 1.0 0.9 1.0 0.2 0.4 0.2 0.3 0.2 0.5 0.3 0.5 1.6 0.3 0.4 MINDE j (σ = 1.5) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 0.9 1.2 1.0 0.4 1.0 0.6 1.7 0.4 0.9 1.0 1.0 0.9 0.9 0.9 0.9 0.9 1.0 0.2 0.4 0.2 0.2 0.2 0.5 0.3 0.5 1.6 0.3 0.4 MINDE j (σ = 2) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 0.9 1.0 0.3 0.9 1.3 1.0 0.4 1.0 0.6 1.8 0.4 0.9 1.0 1.0 0.9 0.9 0.9 0.9 0.9 0.9 0.2 0.4 0.2 0.3 0.2 0.5 0.3 0.5 1.7 0.3 0.4 MINDE j (σ = 3) 0.2 0.4 0.2 0.4 0.4 0.4 0.4 0.9 1.0 0.9 1.0 0.3 0.9 1.3 1.0 0.4 0.9 0.6 1.8 0.4 0.9 1.0 0.9 0.9 0.9 0.8 0.9 0.9 0.9 0.2 0.4 0.2 0.2 0.1 0.5 0.3 0.5 1.6 0.3 0.4 MINDE j (σ = 5) 0.2 0.4 0.2 0.4 0.4 0.4 0.4 0.9 1.0 0.9 1.0 0.3 1.0 1.4 1.0 0.4 0.9 0.6 1.8 0.4 0.9 1.0 0.9 0.9 0.9 0.8 0.8 0.9 0.9 0.2 0.4 0.2 0.2 0.1 0.5 0.2 0.5 1.6 0.3 0.4 MINDE j (σ = 10) 0.2 0.4 0.2 0.4 0.4 0.4 0.4 0.8 1.0 0.9 1.0 0.3 1.0 1.4 1.0 0.4 0.9 0.6 1.9 0.4 0.9 1.0 0.9 0.9 0.9 0.8 0.8 0.9 0.9 0.2 0.4 0.2 0.2 0.1 0.5 0.2 0.5 1.6 0.3 0.4 MINDE c (σ = 0.5) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 0.9 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.7 0.4 0.9 1.0 1.0 0.9 1.0 0.9 0.9 1.0 1.0 0.1 0.3 0.2 0.3 0.2 0.4 0.3 0.3 1.6 0.3 0.4 MINDE c (σ = 1) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 1.0 0.9 0.9 0.9 0.9 1.0 0.9 0.1 0.3 0.2 0.3 0.2 0.4 0.3 0.3 1.7 0.3 0.4 MINDE c (σ = 1.5) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 0.9 0.9 0.9 0.9 0.9 1.0 0.9 0.1 0.3 0.2 0.3 0.2 0.4 0.3 0.3 1.7 0.3 0.4 MINDE c (σ = 2) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.2 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.1 0.3 0.2 0.3 0.2 0.4 0.3 0.3 1.6 0.3 0.4 MINDE c (σ = 3) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 0.9 0.9 0.9 0.9 0.9 1.0 0.9 0.1 0.3 0.2 0.3 0.2 0.4 0.3 0.3 1.7 0.3 0.4 MINDE c (σ = 5) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.2 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.1 0.3 0.2 0.3 0.2 0.4 0.3 0.3 1.7 0.3 0.4 MINDE c (σ = 10) 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.2 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 0.9 0.9 0.9 0.9 0.9 1.0 0.9 0.1 0.3 0.2 0.3 0.2 0.4 0.3 0.3 1.7 0.3 0.4 MINE 0.2 0.4 0.2 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 0.9 0.9 0.9 0.8 0.7 0.6 0.9 0.9 0.9 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.4 1.7 0.3 0.4 Info NCE 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 1.0 0.8 0.8 0.8 0.9 1.0 1.0 0.2 0.3 0.2 0.3 0.2 0.4 0.3 0.4 1.7 0.3 0.4 D-V 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 1.0 0.8 0.8 0.8 0.9 1.0 1.0 0.0 0.0 0.1 0.1 0.2 0.2 0.2 0.4 1.7 0.3 0.4 NWJ 0.2 0.4 0.3 0.4 0.4 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.0 1.3 1.0 0.4 1.0 0.6 1.6 0.4 0.9 1.0 1.0 0.8 0.8 0.8 0.9 1.0 1.0 0.0 0.0 0.0 -0.6 0.1 0.1 0.2 0.4 1.7 0.3 0.4 KSG 0.2 0.4 0.2 0.2 0.4 0.4 0.4 0.2 0.9 0.7 1.0 0.3 0.2 1.1 1.0 0.4 0.7 0.6 1.3 0.4 0.2 0.9 0.7 0.2 0.7 0.6 0.2 0.9 0.7 0.2 0.2 0.1 0.1 0.1 0.2 0.2 0.4 1.7 0.3 0.4 LNN 0.2 0.9 2.7 6.6 0.4 0.4 0.4 17.3 2.7 6.4 1.3 0.6 17.3 17.3 3.1 2.5 7.3 6.8 33.5 0.4 17.3 2.5 7.3 17.3 3.1 7.3 17.3 2.4 7.2 -0.2 -0.6 0.5 1.2 2.1 2.7 5.5 0.3 1.5 0.3 0.4 CCA 0.0 0.0 0.0 0.0 0.3 0.4 0.4 1.0 1.0 1.0 1.0 0.3 1.1 1.3 1.0 0.4 1.0 0.6 1.8 0.4 1.0 1.0 1.0 0.9 0.2 0.4 1.0 0.8 0.9 0.3 1.0 0.1 0.1 0.0 0.3 0.0 0.0 1.6 0.2 0.4 Asinh @ St 1 1 (dof=1) Asinh @ St 2 2 (dof=1) Asinh @ St 3 3 (dof=2) Asinh @ St 5 5 (dof=2) Bimodal 1 1 Bivariate Nm 1 1 Hc @ Bivariate Nm 1 1 Hc @ Mn 25 25 (2-pair) Hc @ Mn 3 3 (2-pair) Hc @ Mn 5 5 (2-pair) Mn 2 2 (2-pair) Mn 2 2 (dense) Mn 25 25 (2-pair) Mn 25 25 (dense) Mn 3 3 (2-pair) Mn 3 3 (dense) Mn 5 5 (2-pair) Mn 5 5 (dense) Mn 50 50 (dense) Nm CDF @ Bivariate Nm 1 1 Nm CDF @ Mn 25 25 (2-pair) Nm CDF @ Mn 3 3 (2-pair) Nm CDF @ Mn 5 5 (2-pair) Sp @ Mn 25 25 (2-pair) Sp @ Mn 3 3 (2-pair) Sp @ Mn 5 5 (2-pair) Sp @ Nm CDF @ Mn 25 25 (2-pair) Sp @ Nm CDF @ Mn 3 3 (2-pair) Sp @ Nm CDF @ Mn 5 5 (2-pair) St 1 1 (dof=1) St 2 2 (dof=1) St 2 2 (dof=2) St 3 3 (dof=2) St 3 3 (dof=3) St 5 5 (dof=2) St 5 5 (dof=3) Swiss roll 2 1 Uniform 1 1 (additive noise=0.1) Uniform 1 1 (additive noise=0.75) Wiggly @ Bivariate Nm 1 1 Table 4: MINDE j and MINDE c σ ablations study .Mean MI estimates over 10 seeds using N = 10000 samples compared each against the ground-truth. Color indicates relative negative bias (red) and positive bias (blue). Our method and all neural estimators were trained with 100k training samples. List of abbreviations ( Mn: Multinormal, St: Student-t, Nm: Normal, Hc: Half-cube, Sp: Spiral) Published as a conference paper at ICLR 2024 D.2 FULL RESULTS WITH STANDARD DEVIATION We report in Table 5 mean results without quantization for the different methods. Figures 3 and 4 contains box-plots for all the competitors and all the tasks. Published as a conference paper at ICLR 2024 GT 0.22 0.43 0.29 0.45 0.41 0.41 0.41 1.02 1.02 1.02 1.02 0.29 1.02 1.29 1.02 0.41 1.02 0.59 1.62 0.41 1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.02 0.22 0.43 0.19 0.29 0.18 0.45 0.30 0.41 1.71 0.33 0.41 MINDE j (σ = 1) 0.21 0.40 0.26 0.40 0.41 0.41 0.41 1.10 1.01 1.00 1.01 0.29 0.91 1.18 1.00 0.42 1.00 0.59 1.72 0.40 0.96 0.96 0.99 0.87 0.90 0.90 0.95 0.95 0.98 0.17 0.35 0.18 0.25 0.17 0.47 0.29 0.49 1.65 0.31 0.41 MINDE j 0.22 0.42 0.28 0.42 0.41 0.42 0.42 1.19 1.02 1.02 1.02 0.29 0.99 1.31 1.02 0.41 1.01 0.59 1.73 0.41 1.07 0.99 0.99 0.95 0.92 0.93 1.07 0.99 0.98 0.13 0.24 0.20 0.30 0.18 0.48 0.31 0.40 1.67 0.31 0.42 MINDE c (σ = 1) 0.21 0.42 0.27 0.42 0.41 0.41 0.41 0.96 1.00 0.99 1.01 0.29 1.00 1.26 1.01 0.41 1.01 0.59 1.62 0.40 0.94 0.97 0.95 0.92 0.94 0.93 0.93 0.96 0.94 0.13 0.28 0.18 0.29 0.18 0.42 0.30 0.32 1.67 0.31 0.41 MINDE c 0.21 0.42 0.28 0.42 0.41 0.41 0.41 1.00 1.01 1.01 1.01 0.29 1.00 1.27 1.01 0.41 1.01 0.59 1.60 0.40 0.98 0.99 0.98 0.92 0.94 0.94 0.98 0.98 0.98 0.14 0.26 0.19 0.28 0.17 0.44 0.29 0.40 1.66 0.31 0.41 MINE 0.23 0.38 0.24 0.36 0.40 0.41 0.41 0.96 0.99 0.98 1.01 0.30 0.99 1.28 1.01 0.41 1.00 0.59 1.60 0.39 0.88 0.90 0.90 0.81 0.70 0.65 0.88 0.89 0.87 0.02 0.01 0.12 0.12 0.13 0.16 0.22 0.39 1.66 0.32 0.41 Info NCE 0.22 0.41 0.27 0.40 0.41 0.41 0.41 0.98 1.01 1.01 1.02 0.29 0.99 1.28 1.01 0.41 1.02 0.59 1.61 0.40 0.92 0.98 0.99 0.83 0.84 0.82 0.92 0.96 0.96 0.15 0.30 0.18 0.27 0.17 0.41 0.28 0.40 1.69 0.32 0.41 D-V 0.22 0.41 0.27 0.40 0.41 0.41 0.41 0.98 1.01 1.01 1.02 0.29 0.99 1.28 1.01 0.41 1.02 0.59 1.61 0.40 0.93 0.98 0.99 0.82 0.82 0.81 0.92 0.96 0.96 0.01 0.05 0.11 0.13 0.15 0.22 0.21 0.40 1.69 0.32 0.41 NWJ 0.22 0.41 0.27 0.40 0.41 0.41 0.41 0.98 1.01 1.01 1.02 0.29 0.99 1.28 1.01 0.41 1.02 0.59 1.60 0.40 0.93 0.98 0.98 0.82 0.82 0.80 0.92 0.95 0.96 0.03 0.02 0.04 -0.65 0.12 0.12 0.21 0.40 1.69 0.32 0.41 KSG 0.22 0.38 0.19 0.24 0.42 0.42 0.42 0.17 0.87 0.66 1.03 0.29 0.20 1.07 0.95 0.41 0.74 0.57 1.28 0.42 0.20 0.92 0.72 0.18 0.71 0.55 0.20 0.90 0.69 0.16 0.22 0.09 0.12 0.07 0.20 0.15 0.42 1.68 0.32 0.42 LNN 0.25 0.89 2.71 6.65 0.41 0.42 0.42 2.68 6.45 1.27 0.65 3.10 2.49 7.31 6.76 0.39 2.49 7.27 3.10 7.31 2.38 7.24 0.53 1.20 2.12 2.74 5.48 0.34 1.48 0.29 0.42 CCA 0.00 0.00 0.00 0.00 0.34 0.41 0.38 0.99 0.95 0.96 1.02 0.29 1.06 1.33 1.02 0.41 1.02 0.60 1.75 0.39 1.00 0.97 0.96 0.86 0.23 0.39 0.99 0.84 0.90 0.33 0.95 0.11 0.13 0.01 0.31 0.02 0.02 1.63 0.19 0.38 Do E(Gaussian) 0.16 0.48 0.27 0.57 0.37 0.37 0.39 0.67 0.97 0.96 0.95 0.35 0.67 7.83 0.95 0.64 0.94 1.27 16.11 0.37 0.70 0.99 0.95 0.48 0.58 0.56 0.57 0.74 0.77 6.74 7.93 1.78 2.54 0.61 4.24 1.17 1.57 0.11 0.38 Do E(Logistic) 0.13 0.37 0.21 0.43 0.41 0.35 0.36 0.62 0.92 0.92 0.95 0.34 0.69 7.83 0.95 0.63 0.93 1.27 16.15 0.41 0.78 1.08 1.05 0.47 0.60 0.55 0.67 0.79 0.81 -0.32 2.00 0.46 0.82 0.29 1.48 0.59 1.58 0.08 0.35 Asinh @ St 1 1 (dof=1) Asinh @ St 2 2 (dof=1) Asinh @ St 3 3 (dof=2) Asinh @ St 5 5 (dof=2) Bimodal 1 1 Bivariate Nm 1 1 Hc @ Bivariate Nm 1 1 Hc @ Mn 25 25 (2-pair) Hc @ Mn 3 3 (2-pair) Hc @ Mn 5 5 (2-pair) Mn 2 2 (2-pair) Mn 2 2 (dense) Mn 25 25 (2-pair) Mn 25 25 (dense) Mn 3 3 (2-pair) Mn 3 3 (dense) Mn 5 5 (2-pair) Mn 5 5 (dense) Mn 50 50 (dense) Nm CDF @ Bivariate Nm 1 1 Nm CDF @ Mn 25 25 (2-pair) Nm CDF @ Mn 3 3 (2-pair) Nm CDF @ Mn 5 5 (2-pair) Sp @ Mn 25 25 (2-pair) Sp @ Mn 3 3 (2-pair) Sp @ Mn 5 5 (2-pair) Sp @ Nm CDF @ Mn 25 25 (2-pair) Sp @ Nm CDF @ Mn 3 3 (2-pair) Sp @ Nm CDF @ Mn 5 5 (2-pair) St 1 1 (dof=1) St 2 2 (dof=1) St 2 2 (dof=2) St 3 3 (dof=2) St 3 3 (dof=3) St 5 5 (dof=2) St 5 5 (dof=3) Swiss roll 2 1 Uniform 1 1 (additive noise=.1) Uniform 1 1 (additive noise=.75) Wiggly @ Bivariate Nm 1 1 Table 5: Mean estimate over 10 seeds using N = 10000 samples compared each against the ground-truth. Our method and all neural estimators were trained with 100k training samples. List of abbreviations ( Mn: Multinormal, St: Student-t, Nm: Normal, Hc: Half-cube, Sp: Spiral) Published as a conference paper at ICLR 2024 Figure 3: We report MI estimate results over 10 seeds for N =10000 for our method and competitors for training size 100k sample. A method absent from the depiction implies either non convergence during training or results out of scale Published as a conference paper at ICLR 2024 Figure 4: We report MI estimate results over 10 seeds for N =10000 for our method and competitors for training size 100k sample. D.3 TRAINING SIZE ABLATION STUDY We here report, in Figures 5 to 8 the results of our ablation study on the training size, varying in the range 5k,10k,50k,100k. Published as a conference paper at ICLR 2024 Figure 5: Training Size ablation study : We report MI estimate results for our method and competitors as a function of the training size used (5k,10k,50k,100k). For readability, we discard the baselines with estimation (error > 2 * GT) or high standard deviation. All results are averaged over 5 seeds. Due the benchmark size, we split the results into 4 figures each containing 10 benchmarks. A method absent from the depiction implies either non convergence during training or results out of scale. In this first plot we report tasks 1-10. Published as a conference paper at ICLR 2024 Figure 6: Part 2 of Figure 5, tasks 11-20. Published as a conference paper at ICLR 2024 Figure 7: Part 3 of Figure 5, tasks 21-30. Published as a conference paper at ICLR 2024 Figure 8: Part 4 of Figure 5, tasks 31-40. E ANALYSIS OF CONDITIONAL DIFFUSION DYNAMICS USING MINDE Diffusion models have achieved outstanding success in generating high-quality images, text, audio, and video across various domains. Recently, the generation of diverse and realistic data modalities Published as a conference paper at ICLR 2024 (images, videos, sound) from open-ended text prompts [Ramesh et al., 2022; Saharia et al., 2022; Rombach et al., 2022] has projected practitioners into a whole new paradigm for content creation. A remarkable property of our MINDE method is its generalization to any score based model. Then, our method can be considered as a plug and play tool to explore information theoretic properties of score-based diffusion models: in particular, in this section we use MINDE to estimate MI in order to explain the dynamics of image conditional generation, by analyzing the influence of the prompt on the image generation through time. Prompt influence of conditional sampling. Generative diffusion models can be interpreted as iterative schemes in which starting from pure noise, at each iteration, refinements are applied until a sample from the data distribution is obtained. In recent work on text conditional image generation (image A, text prompt B) by Balaji et al. [2022], it has been observed that the role of the text prompt throughout the generative process has not constant importance . Indeed: At the early sampling stage, when the input data to the denoising network is closer to the random noise, the diffusion model mainly relies on the text prompt to guide the sampling process. As the generation continues, the model gradually shifts towards visual features to denoise images, mostly ignoring the input text prompt [Balaji et al., 2022]. Such claim has been motivated by carefully engineered metric analysis such as self and cross attention maps between images and text, as a function of the generation time, as well as visual inspection of the change in generated images when switching the prompt at different stages of the refinement. Using MINDE, we can refine heuristic-based methods and produce a similar analysis using theoretically sound information theoretic quantities. In particular, we analyze the conditional mutual information I(A, B | Xτ), being Xτ the result of the generation process at time τ (recall that the time runs backward from T to 0 during generation, and consequently A = X0 and B = Y0). Such metric quantifies, given an observation of the generation process at time τ, how much information the prompt B carries about the final generated image A. Clearly, when τ = T, the initial sample is independent from both A and B. Consequently, the conditional mutual information will coincide with I(A, B). More formally, we consider the following quantity: I(A, B | Xτ) = I(A, B) [I(Xτ, B) I(Xτ, A | B)] , (26) = I(A, B) [H(Xτ) H(Xτ|B) H(Xτ|B) + H(Xτ|A, B)] , (27) = I(A, B) I(Xτ, B), (28) where Eq. (27) is simplified due to the Markov chain A X0 Xτ, so H(Xτ|A, B) = H(Xτ|X0, B) = H(Xτ|B). Next, we use our MINDE estimator, whereby the marginal and conditional entropies can be estimated efficiently. The following approximation of the quantity in interest can be derived: I(A, B | Xτ) EPµC sµA t (Xt) sµAY0 t (Xt) 2 dt In our experiments, we also include a MINDE-(σ) version which can be obtained similarly to Eq. (29). Experimental setting. We perform our experimental analysis of the influence of a prompt on image generation using Stable Diffusion [Rombach et al., 2022], using the original code-base and pre-trained checkpoints.2 The original Stable Diffusion model was trained using the DDPM framework [Ho et al., 2020] on images latent space. This framework is equivalent to the discrete-time version of VPSDE [Song et al., 2021]. Using the text prompt samples from LAION dataset Schuhmann et al. [2022], we synthetically generate image samples. We set guidance mechanism to 1.0 to ensure that 2https://huggingface.co/stabilityai/stable-diffusion-2-1 Published as a conference paper at ICLR 2024 the images only contain text conditional content. We use 1000 samples and approximate the integral using a Simpson integrator 3 with a discretization over 1000 timesteps. Results. We report in Figure 9 values of I(A, B | Xτ) as a function of (reverse) diffusion time, where A is in the image domain and B is in the text domain. In a similar vein to what observed by Balaji et al. [2022], our results indicate that I(A, B | Xτ) is very high when τ T, which indicates that the text prompt has maximal influence during the early stage of image generation. This measurement is relatively stable at high MI values until τ 0.8. Then, the influence of the prompt gradually fades, as indicated by decreasing steadily MI values. This corroborates the idea that mutual information can be adopted as an exploratory tool for the analysis of complex, high dimensional distributions in real use cases. The intuition pointed out by our MINDE estimator is further consolidated by the qualitative samples in Figure 10, where we perform the following experiment: we test whether switching from an original prompt to a different prompt during the backward diffusion semantically impacts the final generated images. We observe that changing the prompt before τ 0.8 results almost surely with semantically coherent generated image with the second prompt. Instead, when τ < 0.8, the second prompt influence diminishes gradually. We observe that for all the qualitative samples shown in Figure 10 the second prompt has no influence on the generated image after τ < 0.7. Figure 9: I(A, B | Xτ) as a function of τ. 3https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate. simpson.html Published as a conference paper at ICLR 2024 Figure 10: To validate the explanatory results obtained via the application of our MINDE estimator, we perform the following experiment: Conditional generation is carried out with Prompt 1 until time τ, whereas after the conditioning signal is switched to Prompt 2. We use the same Stable diffusion model as in the previous experiment with guidance scale set to 9. F SCALABALITY OF MINDE In this Section, we study the generalization of our MINDE estimator to more than two random variables. We consider the information interaction between three random variables A,B and C, defined as: I(A, B, C) = I(A, B) I(A, B|C) (30) = H(A) H(A|B) (H(A|B, C) H(A, C)) (31) Estimation of such quantity is possible through a simple extension of Eq. (17) to three random variables, considering three parameters α, β, γ {0, 1}. In particular, we explore the case where the three random variables are distributed according to a multivariate Gaussian distribution: A γ1, B = A + N1 (with N1 γϵ) and C = A + N2 (with N2 γρ). By changing the values of the parameters, it is possible to change the value of the interaction information. We report in Figure 11 the estimated values versus the corresponding ground truths, showing that MINDE variants can be effectively adapted for the task of information estimation between more than two random variables. Figure 11: MI estimation results for MINDE-j on 3 variables