# diffeomorphic_information_neural_estimation__1498fd7c.pdf

Diffeomorphic Information Neural Estimation

Bao Duong, Thin Nguyen

Applied Artificial Intelligence Institute, Deakin University, Australia {duongng,thin.nguyen}@deakin.edu.au

Mutual Information (MI) and Conditional Mutual Information (CMI) are multi-purpose tools from information theory that are able to naturally measure the statistical dependencies between random variables, thus they are usually of central interest in several statistical and machine learning tasks, such as conditional independence testing and representation learning. However, estimating CMI, or even MI, is infamously challenging due the intractable formulation. In this study, we introduce DINE (Diffeomorphic Information Neural Estimator) a novel approach for estimating CMI of continuous random variables, inspired by the invariance of CMI over diffeomorphic maps. We show that the variables of interest can be replaced with appropriate surrogates that follow simpler distributions, allowing the CMI to be efficiently evaluated via analytical solutions. Additionally, we demonstrate the quality of the proposed estimator in comparison with state-of-the-arts in three important tasks, including estimating MI, CMI, as well as its application in conditional independence testing. The empirical evaluations show that DINE consistently outperforms competitors in all tasks and is able to adapt very well to complex and high-dimensional relationships.

Introduction Mutual Information (MI) and Conditional Mutual Information (CMI) are pivotal dependence measures between random variables for general non-linear relationships. In statistics and machine learning, they have been employed in a broad variety of problems, such as conditional independence testing (Runge 2018; Mukherjee, Asnani, and Kannan 2020), unsupervised representation learning (Chen et al. 2016), search engine (Magerman and Marcus 1990), and feature selection (Peng, Long, and Ding 2005). The MI of two random variables X and Y measures the expected point-wise information, where the expectation is taken over the joint distribution PXY . Due to the expectation, estimating mutual information for continuous variables remains notoriously difficult. Even if one possesses the specification of the joint distribution, i.e., a closedform of the density, which is most of the time unknown in practice, the expectation may still be intractable. Consequently, exact MI estimation is only possible for discrete

Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

random variables. Historically, MI has been estimated by non-paramtric approachs (Kwak and Choi 2002; Paninski 2003; Kraskov, Stögbauer, and Grassberger 2004), which are however not widely applicable due to their unfriendliness with sample size or dimensionality. Recently, variational approaches have been proposed to estimate the lower bound of MI (Belghazi et al. 2018; Oord, Li, and Vinyals 2018). However, a critical limitation of MI lower bound estimators has been studied by (Mc Allester and Stratos 2020), who show that any distribution-free high-confidence lower bound estimation of mutual information is limited above by O (ln n) where n is the sample size. More recent approaches includes hashing (Noshad, Zeng, and Hero 2019), classifier-based estimator (Mukherjee, Asnani, and Kannan 2020), and inductive maximum-entropy copula approach (Samo 2021). While estimating MI is hard, estimating CMI is of magnitudes harder due to the presence of the conditioning set. Therefore, CMI estimation methods have seen slower developments than its MI counterparts. Recent developments for CMI estimation include (Runge 2018; Molavipour, Bassi, and Skoglund 2021; Mukherjee, Asnani, and Kannan 2020). Present work. In this paper, we propose DINE1 (Diffeomorphic Information Neural Estimator) a unifying framework that closes the gap between the CMI and MI estimation problems. The approach is advantageous compared with novel variational methods in the way that it can estimate the exact information measure, instead of a lower-bound. Specifically, we harness the observation that CMI is invariant over conditional diffeomorphisms, i.e., differentiable and invertible maps with differentiable inverse parametrized by the conditioning variable. As a direct consequence, first, we can now build a welldesigned conditional diffeomorphic transformation that breaks the statistical dependence between the conditioning variable with the transformed variables, but keeps the information measure unchanged, reducing the CMI to an equivalent MI. Second, the approach offers a complete control over the distribution form of the newly induced MI estimation problem, thus we can easily restrict it to an amenable class of simple distributions with well-established properties and estimate the resultant MI via available analytic forms. Being

1Source code and relevant data sets are available at https://github.com/baosws/DINE

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

aided by the powerful expressivity of neural networks and normalizing flows (Papamakarios et al. 2021), we can define a rich family of diffeomorphic transformations that can handle a wide range of non-linear relationships, but are still efficient in sample size and dimensionality. Our numerical experiments show that the proposed DINE estimator can consistently outperforms the state-of-the-arts in both the MI and CMI estimation tasks. We also apply DINE to test for conditional independence (CI) an important statistical problem where the presence of the conditioning variable is a major obstacle, in which the empirical results indicate that the distinctively accurate CMI estimation of DINE allows for a high-accuracy test. Contributions. The key contributions of our study are summarized as follows: We present a reduction of any CMI estimation problem to an equivalent MI estimation problem with the unchanged information measure, which overcomes the central difficulty in CMI estimation compared with MI estimation. We introduce DINE, a CMI estimator that is flexible, efficient, and trainable via gradient-based optimizers. We also provide some theoretical properties of the method. We demonstrate the accuracy of DINE in estimating both MI and CMI in comparisons with state-of-the-arts under varying sample sufficiencies, dimensionalities, and nonlinear relationships. As a follow-up application of CMI estimation, we also use DINE to test for conditional independence (CI) an important statistical problem with a central role in causality, and show that the test performs really well, as well as being able to surpass state-of-the-art baselines by large margins.

Background In this Section we formalize the CMI estimation problem and explain the characterization of CMI that motivated our method. Regarding notational interpretations, we use capitalized letters X, Y , etc., for random variables/vectors, with lowercase letters x, y, etc., being their respective realizations; the distribution is denoted by P ( ) with the respective density p ( ).

Conditional Mutual Information The Conditional Mutual Information between continuous random variables X and Y given Z (with respective compact support sets X, Y, Z) is defined as

I (X, Y |Z) = Z

X p (x, y, z) ln p (x, y|z) p (x|z) p (y|z)dxdydz

= Ep(x,y,z)

ln p (x, y|z) p (x|z) p (y|z)

where we have assumed that the underlying distributions admit the corresponding densities p ( ). Having CMI defined, our technical research question is to estimate I (X, Y |Z) using the empirical distribution P (n) XY Z

of n i.i.d. samples, without having access to the true distribution PXY Z.

Conditional Mutual Information Re-parametrization

Let us first recall that a diffeomorphism is defined as

Definition 1. (Diffeomorphism). A map τ ( ) : X X is called a diffeomorphism if it is differentiable and invertible, and its inverse is also differentiable.

This kind of transformation is of great interest because it exhibits an important invariance property of MI, which was established by (Kraskov, Stögbauer, and Grassberger 2004):

Lemma 1. (MI Re-parametrization, (Kraskov, Stögbauer, and Grassberger 2004)). Let τX : X X and τY : Y Y be two diffeomorphisms where x = τX (x) and y = τY (y), then we have:

I (X, Y ) = I (X , Y ) (3)

Proof. See the Supplementary Material (Duong and Nguyen 2022b).

Inspired by this attractive property, CMI can be shown to be also invariant via any conditional diffeomorphism, which we define as

Definition 2. (Conditional Diffeomorphism). A differentiable map τ ( ; ) : X Z X is called a conditional diffeomorphism if τ ( ; z) : X X is a diffeomorphism for any z Z.

The following Lemma states that it is possible to reparametrize CMI via some conditional diffeomorphisms:

Lemma 2. (CMI Re-parametrization). Let τX : X Z X and τY : Y Z Y be two conditional diffeomorphisms such that PX Y |Z = PX Y , where x = τX (x; z) and y = τY (y; z), then the following holds:

I (X, Y |Z) = I (X , Y ) (4)

Proof. See the Supplementary Material (Duong and Nguyen 2022b).

The Diffeomorphic Information Neural Estimator (DINE)

Our framework can be described using two main components, namely the CMI approximator and the CMI estimator. While the approximator concerns the hypothesis class of models that are used to approximate the CMI given the access to the true data distribution, the CMI estimator defines how to estimate the CMI using models in the said approximator class, but with only a finite sample size.

CMI Approximation We start by giving the general CMI approximator based on densities (as a direct solution to Eqn. (2)): Definition 3. (Density-based CMI approximator). Given a family of density approximators with parameters θ Θ. The density-based CMI approximator IΘ (X, Y |Z) is defined as

IΘ (X, Y |Z) = Ep(x,y,z)

ln pθ (x, y|z) pθ (x|z) pθ (y|z)

where the parameter θ = (θ X, θ Y , θ XY ) Θ are Maximum Likelihood Estimators (MLE) of the true densities p (x, y|z), p (x|z), and p (y|z):

θ X = arg max θX Ep(x,z) [ln pθ (x|z)] (6)

θ Y = arg max θY Ep(y,z) [ln pθ (y|z)] (7)

θ XY = arg max θXY Ep(x,y,z) [ln pθ (x, y|z)] (8)

The innovation of DINE is fueled by the invariance property of CMI over diffeomorphic transformations as stated in Eqn. (4). To realize this end, the recently emerging Normalizing Flows (NF) technique offers us the exact tool we need to exploit the benefits we have just gained from the CMI reparametrization. Simply put, NF offers a general framework to model probability distributions (in this case PXY |Z) by expressing it in terms of a simple base distribution (here PX Y ) and a series of bijective transformations (the diffeomorphisms in our method). For more technical details regarding NFs, see (Kobyzev, Prince, and Brubaker 2020; Papamakarios et al. 2021). Based on this, our approach involves the design of a class of conditional normalizing flows (in contrast with the unconditional normalizing flows that are not parametrized by Z), referred to as the Diffeomorphic Information Neural Approximator (DINA), and formalized as Definition 4. (Diffeomorphic Information Neural Approximator (DINA)). A DINA DΘ is a density-based CMI approximator characterized by the following elements: A compact parameter domain Θ. A family of base distributions {Pθ (X , Y )}θ Θ. A family of conditional normalizing flows {τθ ( ; ) : X Z X }θ Θ. Then, the approximation is defined as

IΘ (X, Y |Z) = IΘ (X , Y ) (9) with x = τθ X (x; z) and y = τθ Y (y; z). As mentioned, MI estimation from finite data is still difficult if the underlying distribution function is unknown or the expectation is intractable. Fortunately, the use of normalizing flows allows us to have a complete control over the distribution of the surrogate variables X and Y . However, with an arbitrary base distribution, the finite sample estimation of the CMI as in Eqn. (5) still involves

averaging the log-density terms, which results in a possibly very large estimation variance. Therefore, to reduce the estimation variance, we look for distributions with simple closed-form expression of the MI. Towards this end, the Gaussian distribution is an excellent choice thanks to its well-studied information-theoretic properties, especially the availability of a closed-form MI that we can make use of. In more details, when the base distribution Pθ (X , Y ) is jointly Gaussian, we approximate the CMI as follows:

Definition 5. (DINA-Gaussian). If Pθ (X , Y ) is multivariate Gaussian, then the DINA approximator with Gaussian base is defined as

IN Θ (X , Y ) = 1

2 ln det Σp(x,z) (X ) det Σp(y,z) (Y )

det Σp(x,y,z) (X Y ) (10)

where Σp(x,z) (X ) is the covariance matrix of X evaluated on the true distribution PXZ, and so on.

For the rest of the main Sections we will assume that Pθ (X , Y ) is jointly multivariate Gaussian with standard Gaussian marginals. That being said, the framework is still flexible to adapt to arbitrary base distributions that are independent of Z, as long as it is efficient to evaluate the marginal MI between X and Y . We describe in more details the architecture of the normalizing flows employed for our framework in the next Section.

Learning Conditional Diffeomorphisms Among a diversely developed literature of normalizing flows (Kobyzev, Prince, and Brubaker 2020; Papamakarios et al. 2021), autoregressive flows remain one of the earliest and most widely adopted. The most attractive characteristic of autoregressive flows is their intrinsic expressiveness. More concretely, autoregressive flows are universal approximators of densities (Papamakarios et al. 2021), meaning they can approximate any probability density to an arbitrary accuracy. Suppose X and U are d-dimensional real-valued random vectors, where we wish to model the true p (x) with respect to the base pθ (u). Autoregressive flows transform each dimension i of x using the information of the dimensions 1..i 1 of itself, hence the name auto regressive:

ui = τθ (xi; hi) , where hi = ci (x<i) (11)

Here the diffeomorphism τ is referred to as the transformer, and the function ci is called the conditioner, which encodes the information from the dimensions < i of x and defines parameters for the transformer. Since ui only depends on x i, the Jacobian matrix Jτ of partial derivatives is lower triangular, so the modeled log density is simply inferred using the change of variables rule as

ln pθ (x) = ln pθ (u) +

i=1 ln ui xi (u) (12)

Furthermore, fitting pθ (x) to p (x) involves maximizing the expected likelihood with respect to the parameter θ:

θ = arg max θ Ep(x) [ln pθ (x)] (13)

where the expectation can be estimated with the sample mean of empirical data. Going back to our problem, for example, to build the conditional diffeomorphism x = τθ (x; z) with a standard Gaussian distribution PX , we design the base distribution, transformer, and conditioner as follows:

Base distribution. First we choose PX to be a ddimensional standard uniform distribution U (0, 1)d, since its density is constant everywhere, so no computation is required for ln pθ (x ) nor its derivatives. Next, to transform PX to the desired standard Gaussian N (0; 1d X), we simply apply the inverse of the cumulative distribution function (CDF) of the standard Gaussian (Φ 1) to x in an element-wise fashion, where Φ (u) = R u N (t; 0, 1) dt.

Transformer. With the base distribution being a standard uniform distribution, a natural choice for the transformer is the CDF of some densities, which is uniformly distributed for any strictly positive density. To make the transformation more expressive, we compose the transformer as a weighted combination of different CDFs parametrized by the conditioner. More particularly, the transformer for the i-th dimension is given by

τ (xi; hi) =

j=1 wij (hi) Φ xi; µij (hi) , σ2 ij (hi) (14)

where we have used a mixture of k CDF components, with the j-th component being a Gaussian CDF with mean µij (hi), variance σ2 ij (hi), and positive weight wij (hi) such that Pk j=1 wij ( ) = 1. This transformer is also a universal approximator for CDFs because its derivative is essentially a Gaussian mixture model (GMM), a canonical universal approximator of densities (Goodfellow, Bengio, and Courville 2016). Put simply, with a sufficient number of Gaussian components, τ (xi; hi) can express any strictly monotonic R (0, 1) map (hence invertible) with an arbitrary accuracy, which is followed by the broad expressiveness of the transformer.

Conditioner. Since we would like to model the conditional diffeomorphism τ (x; z), the conditioner function ci must encode both x<i and z, so instead of hi = ci (x<i), now we let

hi = ci (x<i, z) (15)

In contrary to the transformer, the conditioner needs not to be invertible, so we can freely model it using any family of functions with inputs x<i and z.

Neural Network Parametrization. To maximize the expressivity power of autoregressive flows explained earlier, we parametrize all functional components in Eqn. (14) and Eqn. (15) with neural networks for each dimension, hence the term Neural in DINE. More specifically, let d H be the dimension of H, we model wi : Rd H (0, 1)k as a Multiple Layer Perceptron (MLP) with Softmax outputs, while ci : Ri 1+d Z Rd H, µi : Rd H Rk and ln σ2 i : Rd H Rk are real-valued MLPs for all i = 1..d X. Since the Jacobian matrix Jτ are now differentiable with respect to the parameter θ, any gradient-based continuous optimization framework can be applied to learn θ .

CMI Estimation

Having the ingredients above ready, we can now proceed to define the DINE estimator for CMI:

Definition 6. (Diffeomorphic Information Neural Estimator (DINE)). Consider a DINA approximator DΘ with parameters in a compact domain Θ. DINE is defined as

In (X, Y |Z) = Ep(n)(x,y,z)

ln p(n) (x , y ) p(n) (x ) p(n) (y )

with x = τθ X (x; z) and y = τθ Y (y; z). where the parameter θ = (θ X, θ Y , θ XY ) Θ are Maximum Likelihood Estimators (MLE) of the empirical densities p(n) (x, y|z), p(n) (x|z), and p(n) (y|z):

θ X = arg max θX Ep(n)(x,z) [ln pθ (x|z)] (17)

θ Y = arg max θY Ep(n)(y,z) [ln pθ (y|z)] (18)

θ XY = arg max θXY Ep(n)(x,y,z) [ln pθ (x, y|z)] (19)

Note that θX and θY here denote the parameters of the normalizing flows and the marginal densities pθ (x ) and pθ (y ), while θXY is the parameter of the joint density pθ (x , y ), which is constrained to have marginals pθ X (x ) and pθ Y (y ). Specifically, under the case of multivariate Gaussian base, DINE can be written as

Definition 7. (DINE-Gaussian). If Pθ (X , Y ) is multivariate Gaussian, then the DINE estimator with Gaussian base is defined as

IN n (X, Y |Z) = 1

2 ln det Σn (X ) det Σn (Y )

det Σn (X Y ) (20)

We emphasize that in contrary with the general DINE estimator defined in Eqn. (16), the DINE-Gaussian estimator does not require explicit density evaluations but instead leverages the log determinants of sample covariance matrices that offers a lower estimation variance, and thus makes it the preferable estimator in practice.

Theoretical Properties In this Section we state some important theoretical results regarding the DINE estimator, including estimation variance, consistency, and sample complexity. Proof sketches for these results are provided in the Supplementary Material (Duong and Nguyen 2022b).

Variance Lemma 3. (Variance of DINE-Gaussian). The asymptotic variance of the DINE-Gaussian estimator is given by

Var IN n L O d

, as n (21)

with d being the dimensionality.

Consistency The quality of DINE depends on the choice of (i) a family of normalizing flows and (ii) n i.i.d. samples from the true distribution PXY Z. The following Lemma states that, given a sufficiently expressive DINA, we can approximate the information measure to arbitrary accuracy.

Lemma 4. (Approximability of DINA). For any ϵ > 0, there exists a DINA DΘ with some compact domain Θ Rc such that

|I (X, Y |Z) IΘ (X, Y |Z)| ϵ, almost surely (22)

The next Lemma declares that the estimator almost surely converges to the approximator as the sample size approaches infinity.

Lemma 5. (Estimability of DINE). For any ϵ > 0, given a DINA DΘ with parameters in some compact domain Θ Rc, there exists a N N such that

n N, |In (X, Y |Z) IΘ (X, Y |Z)| ϵ, almost surely (23)

Finally, the two Lemmas above together prove the consistency of DINE:

Theorem 1. (Consistency of DINE). DINE is consistent whenever DINA is sufficiently expressive.

Sample Complexity We make the following assumptions: the log-densities are bounded in [ M, M] and L-Lipschitz continuous with respect to the parameters θ, and the parameter domain Θ Rc is bounded with θ K.

Theorem 2. (Sample complexity of general DINE). Given any accuracy and confidence parameters ϵ, δ > 0, the following holds with probability at least 1 δ

|I (X, Y |Z) In (X, Y |Z)| < ϵ (24)

whenever the sample size n suffices at least

c ln 96KL c

I(X, Y ) in nats

n = 200, d X = d Y = 2

n = 200, d X = d Y = 20

1.0 0.5 0.0 0.5 1.0 ρ

I(X, Y ) in nats

n = 1000, d X = d Y = 2

1.0 0.5 0.0 0.5 1.0 ρ

n = 1000, d X = d Y = 20

True MI MIND MINE DINE (Ours)

Figure 1: Mutual Information estimation performance. We compare the proposed DINE estimator with MINE (Belghazi et al. 2018) and MIND (Samo 2021). Rows: sample sizes, columns: dimensionalities. The dashed line denotes the true MI and the other lines show the averaged estimations for each method over 50 independent runs. The shaded areas show the estimated 95% confidence intervals.

Empirical Evaluations

In what follows, we illustrate that DINE-Gaussian (for brevity we refer to it as just DINE from now on) is far more effective than the alternative MI and CMI estimators in both sample size and dimensionality, especially when the actual information measure is high. Implementation details and parameters selection of all methods are given in the Supplementary Material (Duong and Nguyen 2022b).

Synthetic Data

We consider a diverse set of simulated scenarios covering different degrees of non-linear dependency, sample size, and dimensionality settings. For each independent simulation, we first generate two jointly multivariate Gaussian variables X , Y with same dimensions d X = d Y = d and shared component-wise correlation, i.e.,

(X , Y ) N 0; Id ρId ρId Id

with a correlation

ρ ( 1, 1). As for Z, we randomly choose one of three distributions U ( 0.01, 0.01)d Z, N (0; 0.01Id Z), and Laplace (0; 0.01Id Z). Then, X and Y are defined as

X = f (AZ + X ) (26)

Y = g (BZ + Y ) (27)

I(X, Y |Z) in nats

n = 200, d X = d Y = d Z = 2

n = 200, d X = d Y = d Z = 20

1.0 0.5 0.0 0.5 1.0 ρ

I(X, Y |Z) in nats

n = 1000, d X = d Y = d Z = 2

1.0 0.5 0.0 0.5 1.0 ρ

n = 1000, d X = d Y = d Z = 20

True CMI CCMI KSG DINE (Ours)

Figure 2: Conditional Mutual Information estimation performance. We compare the proposed DINE estimator with CCMI (Mukherjee, Asnani, and Kannan 2020) and KSG (Kraskov, Stögbauer, and Grassberger 2004). Rows: sample sizes, columns: dimensionalities. The dashed line denotes the true CMI and the other lines show the averaged estimations for each method over 50 independent runs. The shaded areas show the estimated 95% confidence intervals.

where A, B Rd d Z have independent entries drawn from N (0; 1), and f, g are randomly chosen from a rich set of mostly non-linear bijective functions f (x) n αx, x3, e x, 1

x, ln x, 1 1+e x o .2

By construction, we have the ground truth CMI I (X, Y |Z) = I (X , Y ) = d

2 ln 1 ρ2 . Finally, n i.i.d. samples x(i), y(i), z(i) n i=1 are generated accordingly.

Comparison with MI Estimators We compare DINE with two state-of-the-arts, the variational MI lower bound estimator MINE3 (Belghazi et al. 2018) and the inductive copula-based MIND4 method (Samo 2021), which are two of the best approaches focusing solely on MI estimation. In this setting, we let Z be empty, i.e., d Z = 0, and vary the correlation ρ in [ 0.99, 0.99]. We consider both

2We scale and translate the inputs before feeding into the functions to ensure numerical stability, e.g. 1 x 1 x min(x)+1,

ln (x) ln (x min (x) + 1), and e x e

x std(x) . 3We adopt the implementation at https://github.com/karlstratos/ doe 4We use the author s implementation https://github.com/ kxytechnologies/kxy-python/

the low sample size n = 200 and the large sample size n = 1000, as well as the low-dimensional d = 2 and highdimensional d = 20 settings. For each of the setting combinations, we evaluate the methods using the same 50 independent synthetic data sets according to the described simulation scheme. The empirical results are recorded in Figure 1. We observe that for all scenarios, our DINE method produces nearly identical estimates with the ground truth, regardless of sample size or dimensionality, with clear distinctions from MINE and MIND. Under the most limited setting of low sample size and high dimensionality (top-right), DINE estimates are still remarkably close the the ground truth, while MINE and MIND visibly struggles when the ground truth mutual information is high. On the other hand, for the most favorable setting of large sample size and low dimensionality (bottom-left), DINE estimates approach the ground truth with a nearly invisible margin, whereas the error gaps of MINE and MIND estimates are clearly distinguishable.

Comparison with CMI Estimators

For this context, we compare DINE with the state-of-theart classifier based estimator CCMI5 (Mukherjee, Asnani, and Kannan 2020) and the popular k-NN based estimator KSG (Kraskov, Stögbauer, and Grassberger 2004). The experiment setup follows closely to the MI estimation experiment, except that now we let d Z = d X = d Y . Figure 2 captures the results. We can see that, compared to the MI estimation setting, the conditioning variable Z degrades the performance of DINE, however only for high ground truth values and not at a considerable magnitude. Meanwhile, the competitors CCMI and KSG do not adapt well to the high dimensional setting when d X = d Y = d Z = 20. Yet, for the low dimensional case, they still perform poorly relative to our DINE approach, especially when the underlying CMI is high.

Application in Conditional Independence Testing

Among the broad range of applications of CMI estimation, the Conditional Independence (CI) test is perhaps one of the most desired. CI testing greatly benefits the field of Causal Discovery (Spirtes et al. 2000). Therefore, in this Section we illustrate that our approach may be used to construct a CI test that strongly outperforms other competitive baselines solely designed for the same goal, as a down-stream evaluation of DINE. Resultantly, the test can be expected to improve Causal Discovery methods significantly.

Formally, CI testing concerns with the statistical hypothesis test with

5We use the implementation from the authors at https://github. com/sudiptodip15/CCMI

5 10 15 20 Dimensionality of Z

5 10 15 20 Dimensionality of Z

5 10 15 20 Dimensionality of Z

Type I error ( )

5 10 15 20 Dimensionality of Z

Type II error ( )

CCIT KCIT DINE (Ours)

Figure 3: Conditional Independence Testing performance as a function of dimensionality. The evaluation metrics are F1 score, AUC (higher is better), Type I and Type II error rates (lower is better), evaluated on 200 independent runs. We compare the proposed DINE-based CI test with KCIT (Zhang et al. 2012) and CCIT (Sen et al. 2017).

H0 : X Y | Z (28) H1 : X Y | Z (29)

Related works In the context of CI testing, kernelbased approaches (Zhang et al. 2012), are generally the most popular and powerful methods for CI testing, which adopt kernels to exploit high order statistics that capture the CI structure of the data. Recently, more modern approaches have also been proposed, such as GAN-based (Shi et al. 2021) classification-based (Sen et al. 2017), or latent representation-based (Duong and Nguyen 2022a) methods with promising results.

Description of the test We design a simple DINE-based CI test inspired by the observation that X Y | Z I (X , Y ) = 0 and use I (X , Y ) as the test statistics. Next, we employ permutation-based boostrapping to simulate the null distribution of the test statistics and estimate the p-value. Finally, given a user-defined significance level α, we reject the null hypothesis H0 if p-value < α and accept it otherwise. The implementation details and parameters of the test are given in the Supplementary Material (Duong and Nguyen 2022b).

Experiments To numerically evaluate the quality of the aforementioned DINE-based CI test, we compare it with the prominent kernel-based test KCIT6 (Zhang et al. 2012) and a more recent state-of-the-art classifier-based test CCIT7 (Sen et al. 2017). In this experiment, we fix d X = d Y = 1 and consider d Z increasing from low to high dimensionalities in [5, 20], with a constant sample size n = 1000, and compare the performance of DINE against the baselines, assessed under four different criterions, namely the F1 score, AUC (higher is better), Type I, and Type II error rates (lower is better). These metrics are evaluated using 200 independent runs (100 runs for each label) for each combination of method and dimensionality of Z. Additionally, for F1 score, Type I, and Type II

6The implementation from the causal-learn package is adopted https://github.com/cmu-phil/causal-learn 7The authors implementation can be found at https://github. com/rajatsen91/CCIT

errors, we adopt the common significance level of α = 0.05. Furthermore, for the case of conditional independence we let ρ = 0, whereas for the conditional dependence case we randomly draw ρ U ([ 0.99, 0.1] [0.1, 0.99]). The data is generated according to the CMI experiment in the previous Section. The numerical comparisons in CI testing are presented in Figure 3, which show that the DINE-based CI test obtains very good scores under all performance metrics. More particularly, it nearly never makes any Type II error, meaning when the relationship is actually conditional dependence, the CMI estimate is rarely too low to be misclassified as conditional independence; meanwhile, its Type I errors are roughly proximate to the rejection threshold α, which is expected from the definition of p-value. Moreover, the F1 and AUC scores of DINE are also highest in all cases and closely approach 100%, suggesting the superior adaptability of DINE to both non-linearity and higher dimensionalities. Regarding the baseline methods, KCIT and CCIT show completely opposite behaviors to each other. While KCIT has relatively low Type II errors, its Type I errors are quite high even at lower dimensionalities and increase rapidly in the increment dimensionality. Conversely, CCIT is quite conservative in Type I error as a trade-off for the consistently high Type II errors. However, their AUC scores are still high, indicating that the optimal threshold exists, but their p-value estimates do not reflect accurately the true p-value.

In this paper we propose DINE, a novel approach for CMI estimation. Through the use of normalizing flows, we simplify the challenging CMI estimation problem into the easier MI estimation, which can be designed to be efficiently evaluable, overcoming the inherent difficulties in existing approaches. We compare DINE with best-in-class methods for MI estimation, CMI estimation, in which DINE shows considerably better performance as compared to its counterparts, as well as being friendly in sample size and dimensionality while adapting well to several non-linear relationships. Finally, we show that DINE can also be used to define a CI test with an improved effectiveness in comparison with state-of-the-art CI tests, thanks to its accurate CMI estimability.

References Belghazi, M. I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; and Hjelm, D. 2018. Mutual information neural estimation. In Proceedings of the International Conference on Machine Learning, 531 540. Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Info GAN: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems. Duong, B.; and Nguyen, T. 2022a. Conditional Independence Testing via Latent Representation Learning. ar Xiv preprint ar Xiv:2209.01547. Duong, B.; and Nguyen, T. 2022b. Diffeomorphic Information Neural Estimation. ar Xiv preprint ar Xiv:2211.10856. Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep learning. MIT Press. Kobyzev, I.; Prince, S. J.; and Brubaker, M. A. 2020. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 3964 3979. Kraskov, A.; Stögbauer, H.; and Grassberger, P. 2004. Estimating mutual information. Physical Review E. Kwak, N.; and Choi, C.-H. 2002. Input feature selection by mutual information based on Parzen window. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1667 1671. Magerman, D. M.; and Marcus, M. P. 1990. Parsing a Natural Language Using Mutual Information Statistics. In AAAI, 984 989. Mc Allester, D.; and Stratos, K. 2020. Formal limitations on the measurement of mutual information. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 875 884. Molavipour, S.; Bassi, G.; and Skoglund, M. 2021. Neural Estimators for Conditional Mutual Information Using Nearest Neighbors Sampling. IEEE Transactions on Signal Processing, 766 780. Mukherjee, S.; Asnani, H.; and Kannan, S. 2020. CCMI: Classifier based conditional mutual information estimation. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 1083 1093. Noshad, M.; Zeng, Y.; and Hero, A. O. 2019. Scalable mutual information estimation using dependence graphs. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2962 2966. Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748. Paninski, L. 2003. Estimation of entropy and mutual information. Neural Computation, 1191 1253. Papamakarios, G.; Nalisnick, E. T.; Rezende, D. J.; Mohamed, S.; and Lakshminarayanan, B. 2021. Normalizing Flows for Probabilistic Modeling and Inference. Journal of Machine Learning Research, 1 64.

Peng, H.; Long, F.; and Ding, C. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1226 1238. Runge, J. 2018. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 938 947. Samo, Y.-L. K. 2021. Inductive mutual information estimation: A convex maximum-entropy copula approach. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2242 2250. Sen, R.; Suresh, A. T.; Shanmugam, K.; Dimakis, A. G.; and Shakkottai, S. 2017. Model-powered conditional independence test. In Advances in Neural Information Processing Systems. Shi, C.; Xu, T.; Bergsma, W.; and Li, L. 2021. Double generative adversarial networks for conditional independence testing. Journal of Machine Learning Research, 1 32. Spirtes, P.; Glymour, C. N.; Scheines, R.; and Heckerman, D. 2000. Causation, prediction, and search. MIT Press. Zhang, K.; Peters, J.; Janzing, D.; and Schölkopf, B. 2012. Kernel-based conditional independence test and application in causal discovery. ar Xiv preprint ar Xiv:1202.3775.