# nonparanormal_information_estimation__6f0e6ec8.pdf Nonparanormal Information Estimation Shashank Singh 1 Barnab as P oczos 1 We study the problem of using i.i.d. samples from an unknown multivariate probability distribution p to estimate the mutual information of p. This problem has recently received attention in two settings: (1) where p is assumed to be Gaussian and (2) where p is assumed only to lie in a large nonparametric smoothness class. Estimators proposed for the Gaussian case converge in high dimensions when the Gaussian assumption holds, but are brittle, failing dramatically when p is not Gaussian, while estimators proposed for the nonparametric case fail to converge with realistic sample sizes except in very low dimension. Hence, there is a lack of robust mutual information estimators for many realistic data. To address this, we propose estimators for mutual information when p is assumed to be a nonparanormal (or Gaussian copula) model, a semiparametric compromise between Gaussian and nonparametric extremes. Using theoretical bounds and experiments, we show these estimators strike a practical balance between robustness and scalability. 1. Introduction This paper is concerned with the problem of estimating entropy or mutual information of an unknown probability density p over RD, given n i.i.d. samples from p. Entropy and mutual information are fundamental information theoretic quantities, and consistent estimators for these quantities have a host of applications within machine learning, statistics, and signal processing. For example, entropy estimators have been used for goodness-of-fit testing (Goria et al., 2005), parameter estimation in semi-parametric models (Wolsztynski et al., 2005), texture classification and image registration (Hero et al., 2001; 2002), change point de- 1Carnegie Mellon University, Pittsburgh, USA. Correspondence to: Shashank Singh . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). tection (Bercher & Vignat, 2000), and anomaly detection in networks (Noble & Cook, 2003; Nychis et al., 2008; Berezi nski et al., 2015). Mutual information is a popular nonparametric measure of dependence, whose estimators have been used in feature selection (Peng et al., 2005; Shishkin et al., 2016), clustering (Aghagolzadeh et al., 2007), learning graphical models (Chow & Liu, 1968), f MRI data processing (Chai et al., 2009), prediction of protein structures (Adami, 2004), boosting and facial expression recognition (Shan et al., 2005), and fitting deep nonlinear models (Hunter & Hodas, 2016). Estimators for both entropy and mutual information have been used in independent component and subspace analysis (Learned-Miller & Fisher, 2003; Szab o et al., 2007a). Motivated by these and other applications, several very recent lines of work (discussed in Section 3) have studied information estimation,1 focusing largely on two settings: 1. Gaussian Setting: If p is known to be Gaussian, there exist information estimators with mean squared error (MSE) at most 2 log 1 D n and an (almost matching) minimax lower bound of 2D/n (Cai et al., 2015). 2. Nonparametric Setting: If p is assumed to lie in a nonparametric smoothness class, such an s-order2 H older or Sobolev class, then the minimax MSE is of asymptotic order max n n 1, n 8s 4s+D o (Birg e & Massart, 1995). In the Gaussian setting, consistent estimation is tractable even in the high-dimensional case where D increases fairly quickly with n, as long as D/n 0. However, optimal estimators for the Gaussian setting rely heavily on the assumption of joint Gaussianity, and their performance can degrade quickly when the data deviate from Gaussian. Especially in high dimensions, it is unlikely that data are jointly Gaussian, making these estimators brittle in practice. In the nonparametric setting, the theoretical convergence rate decays exponentially with D, and, it has been found empirically that information estimators for this setting fail to converge at realistic sample sizes in all but very low dimensions. Also, most nonparametric estimators are sensitive to tuning of bandwidth parameters, which is chal- 1We will collectively call the closely related problems of entropy and mutual information estimation information estimation. 2Here, s encodes the degree of smoothness, roughly corresponding to the number of continuous derivatives of p. Nonparanormal Information Estimation lenging for information estimation, since no empirical error estimate is available for cross-validation. Given these factors, though the Gaussian and nonparametric cases are fairly well understood in theory, there remains a lack of practical information estimators for the common case where data are neither exactly Gaussian nor very low dimensional. The main goal of this paper is to fill the gap between these two extreme settings by studying information estimation in a semiparametric compromise between the two, known as the nonparanormal (a.k.a. Gaussian copula ) model (see Definition 2). The nonparanormal model, analogous to the additive model popular in regression (Friedman & Stuetzle, 1981), limits complexity of interactions among variables but makes minimal assumptions on the marginal distribution of each variable. The result scales better with dimension than nonparametric models, while being more robust than Gaussian models. Proofs of our main results, as well as additional experiments and details are given in the extended version of this paper 3. All code can be found on Git Hub4. 2. Problem statement and notation There are a number of distinct generalizations of mutual information to more than two variables. The definition we consider is simply the difference between the sum of marginal entropies and the joint entropy: Definition 1. (Multivariate mutual information) Let X1, . . . , XD be R-valued random variables with a joint probability density p : RD [0, ) and marginal densities p1, ..., p D : R [0, ). The multivariate mutual information I(X) of X = (X1, . . . , XD) is defined by I(X) := E X p p(X) QD j=1 pj(Xj) j=1 H(Xj) H(X), (1) where H(X) = EX p[log p(X)] denotes entropy of X. This notion of multivariate mutual information, originally due to Watanabe (1960) (who called it total correlation ) measures total dependency, or redundancy, within a set of D random variables. It has also been called the multivariate constraint (Garner, 1962) and multiinformation (Studen y & Vejnarov a, 1998). Many related information theoretic quantities can be expressed in 3accessible at http://www.contrib.andrew. cmu.edu/ sss1/publications/papers/ nonparanormal-information-estimation.pdf 4https://github.com/sss1/ nonparanormal-information terms of I(X), and can thus be estimated using estimators of I(X). Examples include pairwise mutual information I(X, Y ) = I((X, Y )) I(X) I(Y ), which measures dependence between (potentially multivariate) random variables X and Y , conditional mutual information I(X|Z) = I((X, Z)) j=1 I((Xj, Z)), which is useful for characterizing how much dependence within X can be explained by a latent variable Z (Studen y & Vejnarov a, 1998), and transfer entropy (a.k.a. directed information ) TX Y , which measures predictive power of one time series X on the future of another time series Y . I(X) is also related to entropy via Eq. (1), but, unlike the above quantities, this relationship depends on the marginal distributions of X, and hence involves some additional considerations, as discussed in Section 8. We now define the class of nonparanormal distributions, from which we assume our data are drawn. Definition 2. (Nonparanormal distribution, a.k.a. Gaussian copula model) A random vector X = (X1, . . . , XD)T is said to have a nonparanormal distribution (denoted X NPN(Σ; f)) if there exist functions {fj}D j=1 such that each fj : R R is a diffeomorphism 5 and f(X) N(0, Σ), for some (strictly) positive definite Σ RD D with 1 s on the diagonal (i.e., each σj = Σj,j = 1). 6 Σ is called the latent covariance of X and f is called the marginal transformation of X. The nonparanormal family relaxes many constraints of the Gaussian family. Nonparanormal distributions can be multi-modal, skewed, or heavy-tailed, can encode noisy nonlinear dependencies, and need not be supported on RD. Minimal assumptions are made on the marginal distributions; any desired continuously differentiable marginal cumulative distribution function (CDF) Fi of variable Xi corresponds to marginal transformation fi(x) = Φ 1(Fi(x)) (where Φ is the standard normal CDF). As examples, for a Gaussian variable Z, the 2-dimensional case, X1 N(0, 1), and X2 = T(X1 + Z) is nonparanormal when T(x) = x3, T = tanh, T = Φ, or any other diffeomorphism. On the other hand, the limits of the Gaussian copula appear, for example, when T(x) = x2, which is not bijective; then, if E[Z] = 0, the Gaussian copula approximation of (X1, X2) models X1 and X2 as independent. 5A diffeomorphism is a continuously differentiable bijection g : R R R such that g 1 is continuously differentiable. 6Setting E [f(X)] = 0 and each σj = 1 ensures model identifiability, but does not reduce the model space, since these parameters can be absorbed into the marginal transformation f. Nonparanormal Information Estimation We are now ready to formally state our problem: Formal Problem Statement: Given n i.i.d. samples X1, ..., Xn NPN(Σ; f), where Σ and f are both unknown, we would like to estimate I(X). Other notation: D denotes the dimension of the data (i.e., Σ RD D and f : RD RD). For a positive integer k, [k] = {1, ..., k} denotes the set of positive integers less than k (inclusive). For consistency, where possible, we use i [n] to index samples and j [D] to index dimensions (so that, e.g., Xi,j denotes the jth dimension of the ith sample). Given a data matrix X Rn D, our estimators depend on the empirical rank matrix R [n]n D with Ri,j := k=1 1{Xi,j Xk,j}. (2) For a square matrix A Rk k, |A| denotes the determinant of A, AT denotes the transpose of A, and A 2 := max x Rk Ax 2 and A F := s X i,j [k] A2 i,j denote the spectral and Frobenius norms of A, respectively. When A is symmetric, λ1(A) λ2(A) λD(A) are its eigenvalues. 3. Related Work and Our Contributions 3.1. The Nonparanormal Nonparanormal models have been used for modeling dependencies among high-dimensional data in a number of fields, such as graphical modeling of gene expression data (Liu et al., 2012), of neural data (Berkes et al., 2009), and of financial time series (Malevergne & Sornette, 2003; Wilson & Ghahramani, 2010; Hern andez-Lobato et al., 2013), extreme value analysis in hydrology (Renard & Lang, 2007; Aghakouchak, 2014), and informative data compression (Rey & Roth, 2012). With one recent exception (Ince et al., 2017), previous information estimators for the nonparanormal case (Calsaverini & Vicente, 2009; Ma & Sun, 2011; Elidan, 2013), rely on fully nonparametric information estimators as subroutines, and hence suffer strongly from the curse of dimensionality. Very recently, Ince et al. (2017) proposed what we believe is the first mutual information estimator tailored specifically to the nonparanormal case; their estimator is equivalent to one of the estimators (IG, described in Section 4.1) we study. However, they focused on its applications to neuroimaging data analysis, and did not study its performance theoretically or empirically. 3.2. Information Estimation Our motivation for studying the nonparanormal family comes from trying to bridge two recent approaches to information estimation. The first has studied fully nonparametric entropy estimation, assuming only that data are drawn from a smooth probability density p, where smoothness is typically quantified by a H older or Sobolev exponent s (0, ), roughly corresponding to the continuous differentiability of s. In this setting, the minimax optimal MSE rate has been shown by Birg e & Massart (1995) to be O max n n 1, n 8s 4s+D o . This rate slows exponentially with the dimension D, and, while many estimators have been proposed (P al et al., 2010; Sricharan et al., 2011; 2013; Singh & P oczos, 2014a;b; Krishnamurthy et al., 2014; Moon & Hero, 2014b;a; Singh & P oczos, 2016a; Moon et al., 2017) for this setting, their practical use is limited to a few dimensions7. The second area is in the setting where data are assumed to be drawn from a truly Gaussian distribution. Here the highdimensional case is far more optimistic. While this case had been studied previously (Ahmed & Gokhale, 1989; Misra et al., 2005; Srivastava & Gupta, 2008), Cai et al. (2015) recently provided a precise finite-sample analysis based on deriving the exact probability law of the logdeterminant log |bΣ| of the scatter matrix bΣ. From this, they derived a deterministic bias correction, giving an estimator for which they prove an MSE upper bound of 2 log 1 D n and a high-dimensional central limit theorem for the case D as n (but D < n). Cai et al. (2015) also prove a minimax lower bound of 2D/n on MSE, with several interesting consequences. First, consistent information estimation is possible only if D/n 0. Second, since, for small x, log(1 x) x, this lower bound essentially matches the above upper bound when D/n is small. Third, they show this lower bound holds even when restricted to diagonal covariance matrices. Since the upper bound for the general case and the lower bound for the diagonal case essentially match, it follows that Gaussian information estimation is not made easier by structural assumptions such as Σ being bandable, sparse, or Toeplitz, as is common in, for example, stationary Gaussian process models (Cai & Yuan, 2012). This 2D/n lower bound extends to our more general nonparanormal setting. However, we provide a minimax lower bound suggesting that the nonparanormal setting is strictly harder, in that optimal rates depend on Σ. Our results imply 7 Few depends on s and n, but Kandasamy et al. (2015) suggest nonparametric estimators should only be used with D at most 4-6. Rey & Roth (2012) tried using several nonparametric information estimators on the Communities and Crime UCI data set (n = 2195, D = 10), but found all too unstable to be useful. Nonparanormal Information Estimation nonparanormal information estimation does become easier if Σ is assumed to be bandable or Toeplitz. A closely related point is that known convergence rates for the fully nonparametric case require the density p to be bounded away from 0 or have particular tail behavior, due to singularity of the logarithm near 0 and resulting sensitivity of Shannon information-theoretic functionals to regions of low but non-zero probability. In contrast, Cai et al. (2015) need no lower-bound-type assumptions in the Gaussian case. In the nonparanormal case, we show some such condition is needed to prove a uniform rate, but a weaker condition, a positive lower bound on λD(Σ), suffices. The main contributions of this paper are the following: 1. We propose three estimators, b IG, b Iρ, and b Iτ,8 for the mutual information of a nonparanormal distribution. 2. We prove upper bounds, of order O(D2/(λ2 D(Σ)n)) on the mean squared error of b Iρ, providing the first upper bounds for a nonparanormal information estimator. This bound suggests nonparanormal estimators scale far better with D than nonparametric estimators. 3. We prove a minimax lower bound suggesting that, unlike the Gaussian case, difficulty of nonparanormal information estimation depends on the true Σ. 4. We give simulations comparing our proposed estimators to Gaussian and nonparametric estimators. Besides confirming and augmenting our theoretical predictions, these help characterize the settings in which each nonparanormal estimator works best. 5. We present entropy estimators based on b IG, b Iρ, and b Iτ. Though nonparanormal entropy estimation requires somewhat different assumptions from mutual information estimation, we show that entropy can also be estimated at the rate O(D2/(λ2 D(Σ)n)). 4. Nonparanormal Information Estimators In this section, we present three different estimators, IG, Iρ, and Iτ, for the mutual information of a nonparanormal distribution. We begin with a lemma providing common motivation for all three estimators. Since mutual information is invariant to diffeomorphisms of individual variables, it is easy to see that the mutual information of a nonparanormal random variable is the same as that of the latent Gaussian random variable. Specifically: Lemma 3. (Nonparanormal mutual information): Suppose X NPN(Σ; f). Then, 2 log |Σ|. (3) 8Ince et al. (2017) proposed b IG for use in neuroimaging data analysis. To the best of our knowledge, b Iρ and b Iτ are novel. Lemma 3 shows that mutual information of a nonparanormal random variable depends only the latent covariance Σ; the marginal transformations are nuisance parameters, allowing us to avoid difficult nonparametric estimation; the estimators we propose all plug different estimates of Σ into Eq. (3), after a regularization step described in Section 4.3. 4.1. Estimating Σ by Gaussianization The first estimator bΣG of Σ proceeds in two steps. First, the data are transformed to have approximately standard normal marginal distributions, a process Szab o et al. (2007b) referred to as Gaussianization . By the nonparanormal assumption, the Gaussianized data are approximately jointly Gaussian. Then, the latent covariance matrix is estimated by the empirical covariance of the Gaussianized data. More specifically, letting Φ 1 denote the quantile function of the standard normal distribution and recalling the rank matrix R defined in (2), the Gaussianized data e Xi,j := Φ 1 Ri,j (for i [n], j [D]) are obtained by transforming the empirical CDF of the each dimension to approximate Φ. Then, we estimate Σ by the empirical covariance bΣG := 1 n Pn i=1 e Xi e XT i . 4.2. Estimating Σ by rank correlation The second estimator actually has two variants, Iρ and Iτ, respectively based on relating the latent covariance to two classic rank-based dependence measures, Spearman s ρ and Kendall s τ. For two random variables X and Y with CDFs FX, FY : R [0, 1], ρ and τ are defined by ρ(X, Y ) := Corr(FX(X), FY (Y )) and τ(X, Y ) := Corr(sign(X X ), sign(Y Y )), respectively, where Corr(X, Y ) = E[(X E[X])(Y E[Y ])] p denotes the standard Pearson correlation operator and (X , Y ) is an IID copy of (X, Y ). ρ and τ generalize to the D-dimensional setting in the form of rank correlation matrices ρ, τ [ 1, 1]D D with ρj,k = ρ(Xj, Xk) and τj,k = τ(Xj, Xk) for each j, k [D]. Iρ and Iτ are based on a classical result relating the correlation and rank-correlation of a bivariate Gaussian: Theorem 4. (Kruskal, 1958): Suppose (X, Y ) has a Gaussian joint distribution with covariance Σ. Then, Corr(X, Y ) = 2 sin π 6 ρ(X, Y ) = sin π 2 τ(X, Y ) . Nonparanormal Information Estimation ρ and τ are often preferred over Pearson correlation for their relative robustness to outliers and applicability to nonnumerical ordinal data. While these are strengths here as well, the main reason for their relevance is that they are invariant to marginal transformations (i.e., for diffeomorphisms f, g : R R, ρ(f(X), g(Y )) = ρ(X, Y ) and τ(f(X), g(Y )) = τ(X, Y )). As a consequence, the identity provided in Theorem 4 extends unchanged to the case (X, Y ) NPN(Σ; f). This suggests an estimate for Σ based on estimating ρ or τ and plugging this elementwise into the transform x 7 2 sin π 6 x or x 7 sin π 2 x , respectively. Specifically, Σρ is defined by bΣρ := 2 sin π 6 bρ , where bρ = [ Corr(R) is the empirical correlation of the rank matrix R, and sine is applied element-wise. Similarly, bΣτ := sin π 2 bτ , where bτj,k := 1 n 2 X i =ℓ [n] sign(Xi,j Xℓ,j) sign(Xi,k Xℓ,k). 4.3. Regularization and estimating I Unfortunately, unlike usual empirical correlation matrices, none of bΣG, bΣρ, or bΣτ is almost surely strictly positive definite. As a result, directly plugging into the mutual information functional (3) may give or be undefined. To correct for this, we propose a regularization step, in which we project each estimated latent covariance matrix onto the (closed) cone S(z) of symmetric matrices with minimum eigenvalue z > 0. Specifically, for any z > 0, let S(z) := A RD D : A = AT , λD(A) z . For any symmetric matrix A RD D with eigendecomposition bΣ = QΛQ 1 (i.e., QQT = QT Q = ID and Λ is diagonal), the projection Az of A onto S(z) is defined as Az := QΛz Q 1, where Λz is the diagonal matrix with jth nonzero entry (Λz)j,j = max{z, Λj,j}. We call this a projection because Az = argmin B S(z) A B F (see, e.g., Henrion & Malick (2012)). Applying this regularization to bΣG, bΣρ, or bΣτ gives a strictly positive definite estimate bΣG,z, bΣρ,z, or bΣτ,z, respectively, of Σ. We can then estimate I by plugging this into Equation (3), giving our three estimators: b IG,z := 1 2 log bΣG,z , b Iρ,z := 1 2 log bΣρ,z and b Iτ,z := 1 2 log bΣτ,z . 5. Upper Bounds on the Error of b Iρ,z Here, we provide finite-sample upper bounds on the error of the estimator b Iρ based on Spearman s ρ. Proofs are given in the Appendix. We first bound the bias of b Iρ: Proposition 5. Suppose X1, ..., Xn i.i.d. NPN(Σ; f). Then, there exists a constant C > 0 such that, for any z > 0, the bias of b Iρ,z is at most E h b Iρ,z i I C D z n + log |Σz| where Σz is the projection of Σ onto S(z). The first term of the bias stems from nonlinearity of the log-determinant function in Equation 3, which we analyze via Taylor expansion. The second term, λj(Σ) 0, P h b Iρ,z E h b Iρ,z i > ε i 2 exp nz2ε2 Such exponential concentration bounds are useful when one wants to simultaneously bound the error of multiple uses of an estimator, and hence we present it separately as it may be independently useful. However, for the purpose of understanding convergence rates, we are more interested in the variance bound that follows as an easy corollary: Corollary 7. Suppose X1, ..., Xn i.i.d. NPN(Σ; f). Then, for any z > 0, the variance of b Iρ,z is at most V h b Iρ,z i 36π2D2 Given these bias and variance bounds, a bound on the MSE of b Iρ,z follows via the usual bias-variance decomposition: Theorem 8. Suppose X NPN(Σ; f). Then, there exists a constant C such that b Iρ,z I 2 C D2 z2n + log2 |Σz| Nonparanormal Information Estimation A natural question is now how to optimally select the regularization parameter z. While the bound (4) is clearly convex in z, it depends crucially on the unknown spectrum of Σ, and, in particular, on the smallest eigenvalues of Σ. As a result, it is difficult to choose z optimally in general, but we can do so for certain common subclasses of covariance matrices. For example, if Σ is Toeplitz or bandable (i.e., for some c (0, 1), all |Σi,j| c|i j|), then the smallest eigenvalue of Σ can be bounded below (Cai & Yuan, 2012). When Σ is bandable, as we show in the Appendix, this bound can be independent of D. In these cases, the following somewhat simpler MSE bound can be used: Corollary 9. Suppose X NPN(Σ; f), and suppose z λD(Σ). Then, there exists a constant C > 0 such that b Iρ,z I 2 CD2 6. Lower Bounds in terms of Σ If X1, ..., Xn i.i.d N(0, Σ) are Gaussian, for the plug-in estimator 2 log bΣ (where bΣ = 1 n Pn i=1 Xi XT i is the empirical covariance matrix), Cai et al. (2015) showed that the distribution of b I I is independent of the true correlation matrix Σ. This follows from the stability of Gaussians (i.e., that nonsingular linear transformations of Gaussian random variables are Gaussian). In particular, b I I = log |bΣ| log |Σ| = log |Σ 1/2bΣΣ 1/2|, and Σ 1/2bΣΣ 1/2 has the same distribution as log bΣ does in the special case that Σ = ID is the identity. This property is both somewhat surprising, given that I as |Σ| 0, and useful, leading to a tight analysis of the error of b I and confidence intervals that do not depend on Σ. It would be convenient if any nonparanormal information estimators satisfied this property. Unfortunately, the main result of this section is a negative one, showing that this property is unlikely to hold without additional assumptions: Proposition 10. Consider the 2-dimensional case X1, ..., Xn i.i.d N(0, Σ), with Σ = 1 σ σ 1 and let σ (0, 1). Suppose an estimator b I = b I(R) of Iσ = 1 2 log(1 σ2) is a function of the empirical rank matrix R Nn 2 of X. Then, there exists a constant C > 0, depending only n, such that the worst-case MSE of b I over σ (0, σ ) satisfies sup σ (0,σ ) E b I(R) Iσ 2 1 64 C log(1 σ2 ) 2 Clearly, this lower bound tends to as σ 1. As written, this result lower bounds the error of rank-based estimators in the Gaussian case when σ 1. However, to the best of our knowledge, all methods for estimating Σ in the nonparanormal case are functions of R, and prior work (Hoff, 2007) has shown that the rank matrix R is a generalized sufficient statistic for Σ (and hence for I) in the nonparanormal model. Thus, it is reasonable to think of lower bounds for rank-based estimators in the Gaussian case as lower bounds for any estimator in the nonparanormal case. The proof of this result is based on the simple observation that the rank matrix can take only finitely many values. Hence, as σ 1, R tends to be perfectly correlated, providing little information about σ, whereas the dependence of the estimand Iσ on σ increases sharply. This is intuition is formalized in the Appendix using Le Cam s lemma for lower bounds in two-point parameter estimation problems. 7. Empirical Results We compare 5 mutual information estimators: b I: Gaussian plug-in estimator with bias-correction (see Cai et al. (2015)). b IG: Nonparanormal estimator using Gaussianization. b Iρ: Nonparanormal estimator using Spearman s ρ. b Iτ: Nonparanormal estimator using Kendall s τ. b Ik NN: Nonparametric estimator using k-nearest neighbor (k NN) statistics. For Iρ and Iτ, we used a regularization constant z = 10 3. We did not regularize for IG. Although this implies P[IG = ] > 0, this is extremely unlikely for even moderate values of n and never occurred during our experiments, which all use n 32. We thus omit denoting dependence on z. For Ik NN, except as noted in Experiment 3, k = 2, based on recent analysis (Singh & P oczos, 2016b) suggesting that small values of k are best for estimation. Sufficient details to reproduce experiments are given in the Appendix, and MATLAB source code is available at [Omitted for anonymity]. We report MSE based on 1000 i.i.d. trials of each condition. 95% confidence intervals were consistently smaller than plot markers and hence omitted to avoid cluttering plots. Except as specified otherwise, each experiment had the following basic structure: In each trial, a correlation matrix Σ was drawn by normalizing a random covariance matrix from a Wishart distribution, and data X1, ..., Xn i.i.d. N(0, Σ) drawn. All 5 estimators were computed from X1, ..., Xn and squared error from true mutual information (computed from Σ) was recorded. Unless specified otherwise, n = 100 and D = 25. Since our nonparanormal information estimators are func- Nonparanormal Information Estimation tions of ranks of the data, neither the true mutual information nor our non-paranormal estimators depend on the marginal transformations. Thus, except in Experiment 2, where we show the effects of transforming marginals, and Experiment 3, where we add outliers to the data, we perform all experiments on truly Gaussian data, with the understanding that this setting favors the Gaussian estimator. All experimental results are displayed in Figure 1. Experiment 1 (Dependence on n): We first show nonparanormal estimators have parametric O(n 1) dependence on n, unlike b Ik NN, which converges far more slowly. For large n, MSEs of b IG, b Iρ, and b Iτ are close to that of b I. Experiment 2 (Non-Gaussian Marginals): Next, we show nonparanormal estimators are robust to non Gaussianity of the marginals, unlike b I. We applied a nonlinear transformation f to a fraction α [0, 1] of dimen- sions of Gaussian data. That is, we drew Z1, ..., Zn i.i.d. N(0, Σ) and then used data X1, ..., Xn, where Xi,j = T(Zi,j) if j < αD Zi,j if j αD , i [n], j [D], for a diffeomorphism T. Here, we use T(z) = ez. The Appendix shows similar results for several other T. b I performs poorly even when α is quite small. Poor performance of b Ik NN may be due to discontinuity of the density at x = 0. Experiment 3 (Outliers): We now show that nonparanormal estimators are far more robust to the presence of outliers than b I or b Ik NN. To do this, we added outliers to the data according to the method of Liu et al. (2012). After drawing Gaussian data, we independently select βn samples in each dimension, and replace each i.i.d. uniformly at random from { 5, +5}. Performance of b I degrades rapidly even for small β. b Ik NN can fail for atomic distributions, b Ik NN = whenever at least k samples are identical. This mitigate this, we increased k to 20 and ignored trials where b Ik NN = , but b Ik NN ceased to give any finite estimates when β was sufficiently large. For small values of β, nonparanormal estimators surprisingly improve. We hypothesize this is due to convexity of the mutual information functional Eq. (3) in Σ. By Jensen s inequality, estimators which plug-in an approximately unbiased estimate bΣ of Σ are biased towards overestimating I. Adding random (uncorrelated) noise reduces estimated dependence, moving the estimate closer to the true value. If this nonlinearity is indeed a major source of bias, it may be possible to derive a von Mises-type bias correction (see Kandasamy et al. (2015)) accounting for higher-order terms in the Taylor expansion of the log-determinant. Experiment 4 (Dependence on Σ): Here, we verify our results in Section 6 showing that MSE of rank-based estima- tors approaches as |Σ| 0, while MSE of b I is independent of Σ. Here, we set D = 2 and Σ as in Eq. (5), varying σ [0, 1]. Indeed, the MSE of b I does not change, while the MSEs of b IG, b Iρ, and b Iτ all increase as σ 1. This increase seems mild in practice, with performance worse than of b I only when σ > 0.99. b Iτ appears to perform far better than b IG and b Iρ in this regime. Performance of Ik NN degrades far more quickly as σ 1. This phenomenon is explored by Gao et al. (2015), who lower bound error of Ik NN in the presence of strong dependencies, and proposed a correction to improve performance in this case. It is also interesting that errors of b Iρ and b Iτ drop as σ 0. This is likely because, for small σ, the main source of error is the variance of bρ and bτ (as log(1 σ2) σ2 when σ 0). When n and D is fixed, both 2 sin(πbρ/6) and sin(πbτ/2) are asymptotically normal estimates of σ, with asymptotic variances proportional to (1 σ2)2 (Klaassen & Wellner, 1997). By the delta method, since d I dσ = σ 1 σ2 , b Iρ and b Iτ are asymptotically normal estimates of I, with asymptotic variances proportional to σ2 and hence vanishing as σ 0. 8. Estimating Entropy Thus far, we have discussed estimation of mutual information I(X). Mutual information is convenient because it is invariant under marginal transformation, and hence I(X) = I(f(X)) depends only on Σ. While the entropy H(X) does depend on the marginal transform f, fortunately, by Eq. (1), H(X) differs from I(X) only by a sum of univariate entropies. Univariate nonparametric estimation of entropy in has been studied extensively, and there exist several estimators (e.g., based on sample spacings (Beirlant et al., 1997), kernel density estimates (Moon et al., 2016) or k-nearest neighbor methods (Singh & P oczos, 2016b)) that can estimate H(Xj) at the rate n 1 in MSE under relatively mild conditions on the marginal density pj. While the precise assumptions vary with the choice of estimator, they are mainly (a) that pj be lower bounded on its support or have particular (e.g., exponential) tail behavior, and (b) that pj be smooth, typically quantified by a H older or Sobolev condition. Details of these assumptions are in the Appendix. Under these conditions, since there exist estimators b H1, ..., b HD and a constant C > 0 such that E[( b Hj H(Xj))2] C/n, j [D]. (6) Combining these estimators with an estimator, say b Iρ,z, of mutual information gives an estimator of entropy: b Hρ,z := PD j=1 b Hj b Iρ,z. If we assume z = λ 1 D (Σ) is bounded below by a positive Nonparanormal Information Estimation Sample Size (log10(n)) 1.5 2 2.5 3 3 ˆI ˆIG ˆIρ ˆIτ ˆIKNN (a) Experiment 1 non-Gaussian Fraction (α) 0 0.5 1 -1 (b) Experiment 2 Fraction of outliers (β) 0 0.2 0.4 -2 (c) Experiment 3 Cov(X1, X2) 0.999 0.99 0.9 0 -3 (d) Experiment 4 Figure 1. Plots of log10(MSE) plotted over (a) log-sample-size log10(n), (b) fraction α of dimensions with non-Gaussian marginals, (c) fraction β of outlier samples in each dimension, and (d) covariance Σ1,2 = Cov(X1, X2). Note that the x-axis in (d) is decreasing. constant, combining inequality (6) with Corollary 9 gives b Hρ,z H(X) 2 CD2 where C differs from in (6) but is independent of n and D. 9. Conclusions and Future Work This paper suggests nonparanormal information estimation as a practical compromise between the intractable nonparametric case and the limited Gaussian case. We proposed three estimators for this problem and provided the first upper bounds for nonparanormal information estimation. We also gave lower bounds showing how dependence on Σ differs from the Gaussian case and demonstrated empirically that nonparanormal estimators are more robust than Gaussian estimators, even in dimensions too high for nonparametric estimators. Collectively, these results suggest that, by scaling to moderate or high dimensionality without relying on Gaussianity, nonparanormal information estimators may be effective tools with a number of machine learning applications. While the best choice of information estimator inevitably depends on context, as an off-the-shelf guide for practitioners, the estimators we suggest, in order of preference, are: fully nonparametric if D < 6, n > max{100, 10D}. b Iρ if D2/n is small and data may have outliers. b Iτ if D2/n is small and dependencies may be strong. b IG otherwise. b I only given strong belief that data are nearly Gaussian. There are many natural open questions in this line of work. First, in the nonparanormal model, we focused on estimating mutual information I(X), which does not depend on marginal transforms f, and entropy, which decomposes into I(X) and 1-dimensional entropies. In both cases, additional structure imposed by the nonparanormal model allows estimation in higher dimensions than fully nonparametric models. Can nonparanormal assumptions lead to higher dimensional estimators for the many other useful nonlinear functionals of densities (e.g., Lp norms/distances and more general (e.g., R enyi or Tsallis) entropies, mutual informations, and divergences) that do not decompose? Second, there is a gap between our upper bound rate of Σ 1 2 2D2/n and the only known lower bound of 2D/n (from the Gaussian case), though we also showed that bounds for rank-based estimators depend on Σ. Is quadratic dependence on D optimal? How much do rates improve under structural assumptions on Σ? Upper bounds should be derived for other estimators, such as b IG and b Iτ. The 2D/n lower bound proof of Cai et al. (2015) for the Gaussian case, based on the Cramer-Rao inequality (Van den Bos, 2007), is unlikely to tighten in the nonparanormal case, since Fisher information is invariant to diffeomorphisms of the data. Hence, a new approach is needed if the lower bound in the nonparanormal case is to be raised. Finally, our work applies to estimating the log-determinant log |Σ| of the latent correlation in a nonparanormal model. Besides information estimation, the work of Cai et al. (2015) on estimating log |Σ| in the Gaussian model was motivated by the role of log |Σ| in other multivariate statistical tools, such as quadratic discriminant analysis (QDA) and MANOVA (Anderson, 1984). Can our estimators lead to more robust nonparanormal versions of these tools? ACKNOWLEDGEMENTS This research is supported in part by DOE grant DESC0011114 and NSF grant IIS1563887 to B.P., and by an NSF Graduate Research Fellowship to S.S. under Grant No. DGE-1252522. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Adami, Christoph. Information theory in molecular biology. Physics of Life Reviews, 1(1):3 22, 2004. Nonparanormal Information Estimation Aghagolzadeh, Mehdi, Soltanian-Zadeh, Hamid, Araabi, B, and Aghagolzadeh, Ali. A hierarchical clustering based on mutual information maximization. In Image Processing, 2007. ICIP 2007. IEEE International Conference on, volume 1, pp. I 277. IEEE, 2007. Aghakouchak, Amir. Entropy copula in hydrology and climatology. J. Hydrometeorology, 15(6):2176 2189, 2014. Ahmed, Nabil Ali and Gokhale, DV. Entropy expressions and their estimators for multivariate distributions. IEEE Trans. on Information Theory, 35(3):688 692, 1989. Anderson, TW. Multivariate statistical analysis. Wi1ey and Sons, New York, NY, 1984. Beirlant, Jan, Dudewicz, Edward J, Gy orfi, L aszl o, and Van der Meulen, Edward C. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6(1):17 39, 1997. Bercher, J-F and Vignat, Christophe. Estimating the entropy of a signal with applications. IEEE Trans. on Signal Processing, 48 (6):1687 1694, 2000. Berezi nski, Przemysław, Jasiul, Bartosz, and Szpyrka, Marcin. An entropy-based network anomaly detection method. Entropy, 17(4):2367 2408, 2015. Berkes, Pietro, Wood, Frank, and Pillow, Jonathan W. Characterizing neural dependencies with copula models. In Advances in Neural Information Processing Systems, pp. 129 136, 2009. Birg e, Lucien and Massart, Pascal. Estimation of integral functionals of a density. Annals of Stat., pp. 11 29, 1995. Cai, T Tony and Yuan, Ming. Adaptive covariance matrix estimation through block thresholding. Annals of Stat., 40(4): 2014 2042, 2012. Cai, T Tony, Liang, Tengyuan, and Zhou, Harrison H. Law of log determinant of sample covariance matrix and optimal estimation of differential entropy for high-dimensional Gaussian distributions. J. of Multivariate Analysis, 137:161 172, 2015. Calsaverini, Rafael S and Vicente, Renato. An informationtheoretic approach to statistical dependence: Copula information. EPL (Europhysics Letters), 88(6):68003, 2009. Chai, Barry, Walther, Dirk, Beck, Diane, and Fei-Fei, Li. Exploring functional connectivities of the human brain using multivariate information analysis. In Advances in neural information processing systems, pp. 270 278, 2009. Chow, C and Liu, Cong. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462 467, 1968. Elidan, Gal. Copulas in machine learning. In Copulae in mathematical and quantitative finance, pp. 39 60. Springer, 2013. Friedman, Jerome H and Stuetzle, Werner. Projection pursuit regression. JASA, 76(376):817 823, 1981. Gao, Shuyang, Ver Steeg, Greg, and Galstyan, Aram. Efficient estimation of mutual information for strongly dependent variables. In AISTATS, 2015. Garner, Wendell R. Uncertainty and structure as psychological concepts. 1962. Goria, Mohammed Nawaz, Leonenko, Nikolai N, Mergel, Victor V, and Novi Inverardi, Pier Luigi. A new class of random vector entropy estimators and its applications in testing statistical hypotheses. Nonparametric Statistics, 17(3):277 297, 2005. Henrion, Didier and Malick, J erˆome. Projection methods in conic optimization. In Handbook on Semidefinite, Conic and Polynomial Optimization, pp. 565 600. Springer, 2012. Hern andez-Lobato, Jos e Miguel, Lloyd, James R, and Hern andez Lobato, Daniel. Gaussian process conditional copulas with applications to financial time series. In Advances in Neural Information Processing Systems, pp. 1736 1744, 2013. Hero, Alfred O, Ma, Bing, Michel, Olivier, and Gorman, John. Alpha-divergence for classification, indexing and retrieval (revised). 2001. Hero, Alfred O, Ma, Bing, Michel, Olivier JJ, and Gorman, John. Applications of entropic spanning graphs. IEEE Signal Processing Magazine, 19(5):85 95, 2002. Hoff, Peter D. Extending the rank likelihood for semiparametric copula estimation. The Annals of Applied Statistics, pp. 265 283, 2007. Hunter, Jacob S and Hodas, Nathan O. Mutual information for fitting deep nonlinear models. ar Xiv preprint ar Xiv:1612.05708, 2016. Ince, Robin AA, Giordano, Bruno L, Kayser, Christoph, Rousselet, Guillaume A, Gross, Joachim, and Schyns, Philippe G. A statistical framework for neuroimaging data analysis based on mutual information estimated via a gaussian copula. Human brain mapping, 38(3):1541 1573, 2017. Kandasamy, Kirthevasan, Krishnamurthy, Akshay, Poczos, Barnabas, Wasserman, Larry, and Robins, James M. Nonparametric von mises estimators for entropies, divergences and mutual informations. In Advances in Neural Information Processing Systems, pp. 397 405, 2015. Klaassen, Chris AJ and Wellner, Jon A. Efficient estimation in the bivariate normal copula model: normal margins are least favourable. Bernoulli, 3(1):55 77, 1997. Krishnamurthy, Akshay, Kandasamy, Kirthevasan, Poczos, Barnabas, and Wasserman, Larry. Nonparametric estimation of renyi divergence and friends. In International Conference on Machine Learning, pp. 919 927, 2014. Kruskal, William H. Ordinal measures of association. JASA, 53 (284):814 861, 1958. Learned-Miller, E. G. and Fisher, J. W. ICA using spacings estimates of entropy. Journal of Machine Learning Research, 4: 1271 1295, 2003. Liu, Han, Han, Fang, Yuan, Ming, Lafferty, John, and Wasserman, Larry. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics, 40(4):2293 2326, 2012. Ma, Jian and Sun, Zengqi. Mutual information is copula entropy. Tsinghua Science & Tech., 16(1):51 54, 2011. Nonparanormal Information Estimation Malevergne, Yannick and Sornette, Didier. Testing the gaussian copula hypothesis for financial assets dependences. Quantitative Finance, 3(4):231 250, 2003. Misra, Neeraj, Singh, Harshinder, and Demchuk, Eugene. Estimation of the entropy of a multivariate normal distribution. J. Multivariate Analysis, 92(2):324 342, 2005. Moon, Kevin and Hero, Alfred. Multivariate f-divergence estimation with confidence. In Advances in Neural Information Processing Systems, pp. 2420 2428, 2014a. Moon, Kevin R and Hero, Alfred O. Ensemble estimation of multivariate f-divergence. In IEEE International Symposium on Information Theory (ISIT), pp. 356 360. IEEE, 2014b. Moon, Kevin R, Sricharan, Kumar, Greenewald, Kristjan, and Hero, Alfred O. Improving convergence of divergence functional ensemble estimators. In IEEE International Symposium on Information Theory (ISIT), pp. 1133 1137. IEEE, 2016. Moon, Kevin R, Sricharan, Kumar, and Hero III, Alfred O. Ensemble estimation of mutual information. ar Xiv preprint ar Xiv:1701.08083, 2017. Noble, Caleb C and Cook, Diane J. Graph-based anomaly detection. In KDD, pp. 631 636. ACM, 2003. Nychis, George, Sekar, Vyas, Andersen, David G, Kim, Hyong, and Zhang, Hui. An empirical evaluation of entropy-based traffic anomaly detection. In Proceedings of the 8th ACM SIGCOMM conference on Internet measurement, pp. 151 156. ACM, 2008. P al, D avid, P oczos, Barnab as, and Szepesv ari, Csaba. Estimation of R enyi entropy and mutual information based on generalized nearest-neighbor graphs. In Advances in Neural Information Processing Systems, pp. 1849 1857, 2010. Peng, Hanchuan, Long, Fuhui, and Ding, Chris. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. on Pattern Analysis and Machine Intelligence, 27(8):1226 1238, 2005. Renard, Benjamin and Lang, Michel. Use of a Gaussian copula for multivariate extreme value analysis: some case studies in hydrology. Advances in Water Resources, 30(4):897 912, 2007. Rey, M elanie and Roth, Volker. Meta-Gaussian information bottleneck. In Advances in Neural Information Processing Systems, pp. 1916 1924, 2012. Shan, Caifeng, Gong, Shaogang, and Mc Owan, Peter W. Conditional mutual infomation based boosting for facial expression recognition. In BMVC, 2005. Shishkin, Alexander, Bezzubtseva, Anastasia, Drutsa, Alexey, Shishkov, Ilia, Gladkikh, Ekaterina, Gusev, Gleb, and Serdyukov, Pavel. Efficient high-order interaction-aware feature selection based on conditional mutual information. In Advances in Neural Information Processing Systems, pp. 4637 4645, 2016. Singh, Shashank and P oczos, Barnab as. Exponential concentration of a density functional estimator. In Advances in Neural Information Processing Systems, pp. 3032 3040, 2014a. Singh, Shashank and P oczos, Barnab as. Generalized exponential concentration inequality for R enyi divergence estimation. In ICML, pp. 333 341, 2014b. Singh, Shashank and P oczos, Barnab as. Finite-sample analysis of fixed-k nearest neighbor density functional estimators. In Advances in Neural Information Processing Systems, pp. 1217 1225, 2016a. Singh, Shashank and P oczos, Barnab as. Analysis of k-nearest neighbor distances with application to entropy estimation. ar Xiv preprint ar Xiv:1603.08578, 2016b. Sricharan, Kumar, Raich, Raviv, and Hero, Alfred O. k-nearest neighbor estimation of entropies with confidence. In IEEE International Symposium on Information Theory (ISIT), pp. 1205 1209. IEEE, 2011. Sricharan, Kumar, Wei, Dennis, and Hero, Alfred O. Ensemble estimators for multivariate entropy estimation. IEEE Transactions on Information Theory, 59(7):4374 4388, 2013. Srivastava, Santosh and Gupta, Maya R. Bayesian estimation of the entropy of the multivariate Gaussian. In IEEE International Symposium on Information Theory (ISIT), pp. 1103 1107. IEEE, 2008. Studen y, Milan and Vejnarov a, Jirina. The multiinformation function as a tool for measuring stochastic dependence. In Learning in graphical models, pp. 261 297. Springer, 1998. Szab o, Zolt an, P oczos, Barnab as, and L orincz, Andr as. Undercomplete blind subspace deconvolution. Journal of Machine Learning Research, 8(May):1063 1095, 2007a. Szab o, Zolt an, P oczos, Barnab as, Szirtes, G abor, and L orincz, Andr as. Post nonlinear independent subspace analysis. In International Conference on Artificial Neural Networks, pp. 677 686. Springer, 2007b. Van den Bos, Adriaan. Parameter estimation for scientists and engineers. John Wiley & Sons, 2007. Watanabe, Satosi. Information theoretical analysis of multivariate correlation. IBM J. of research and development, 4(1):66 82, 1960. Wilson, Andrew and Ghahramani, Zoubin. Copula processes. In Advances in Neural Information Processing Systems, pp. 2460 2468, 2010. Wolsztynski, E., Thierry, E., and Pronzato, L. Minimum-entropy estimation in semi-parametric models. Signal Process., 85(5): 937 949, 2005. ISSN 0165-1684.