# measuring_dependence_powerfully_and_equitably__149ccc31.pdf

Journal of Machine Learning Research 17 (2016) 1-63 Submitted 6/15; Revised 11/16; Published 11/16

Measuring Dependence Powerfully and Equitably

Yakir A. Reshef yakir@seas.harvard.edu School of Engineering and Applied Sciences Harvard University Cambridge, MA 02138, USA

David N. Reshef dnreshef@mit.edu Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Cambridge, MA 02139, USA

Hilary K. Finucane hilaryf@mit.edu Department of Mathematics Massachusetts Institute of Technology Cambridge, MA 02139, USA.

Pardis C. Sabeti psabeti@oeb.harvard.edu Department of Organismic and Evolutionary Biology Harvard University Cambridge, MA 02138, USA

Michael Mitzenmacher michaelm@eecs.harvard.edu School of Engineering and Applied Sciences Harvard University Cambridge, MA 02138, USA

Co-ﬁrst author. To whom correspondence should be addressed. Co-last author.

Editor: Edo Airoldi

Given a high-dimensional data set, we often wish to ﬁnd the strongest relationships within it. A common strategy is to evaluate a measure of dependence on every variable pair and retain the highest-scoring pairs for follow-up. This strategy works well if the statistic used (a) has good power to detect non-trivial relationships, and (b) is equitable, meaning that for some measure of noise it assigns similar scores to equally noisy relationships regardless of relationship type (e.g., linear, exponential, periodic). In this paper, we deﬁne and theoretically characterize two new statistics that together yield an eﬃcient approach for obtaining both power and equitability. To do this, we ﬁrst introduce a new population measure of dependence and show three equivalent ways that it can be viewed, including as a canonical smoothing of mutual information. We then introduce an eﬃciently computable consistent estimator of our population measure of dependence, and we empirically establish its equitability on a large class of noisy functional relationships. This new statistic has better bias/variance properties and better runtime complexity than a previous heuristic approach. Next, we derive a second, related statistic whose computation is a trivial side-

c 2016 Yakir A. Reshef, David N. Reshef, Hilary K. Finucane, Pardis C. Sabeti, and Michael M. Mitzenmacher.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

product of our algorithm and whose goal is powerful independence testing rather than equitability. We prove that this statistic yields a consistent independence test and show in simulations that the test has good power against independence. Taken together, our results suggest that these two statistics are a valuable pair of tools for exploratory data analysis. Keywords: maximal information coeﬃcient, total information coeﬃcient, equitability, statistical power, mutual information

1. Introduction

The growing dimensionality of today s data sets has popularized the idea of hypothesisgenerating science, whereby a data set is used not to test existing hypotheses but rather to help a researcher formulate new ones. A common approach among practitioners is to evaluate some statistic on many candidate variable pairs in a data set, sort the variable pairs from highest-scoring to lowest, and manually examine all the pairs above a threshold score (Storey and Tibshirani, 2003; Emilsson et al., 2008). A popular class of statistics used for such analyses is measures of dependence, i.e., statistics whose population value is zero in cases of statistical independence and non-zero otherwise. Measures of dependence are attractive because they guarantee that asymptotically no non-trivial relationship will erroneously be declared trivial. In the setting of continuousvalued data, which is our focus, there is a long line of fruitful research on such statistics including, e.g., Hoeﬀding (1948); Rényi (1959); Breiman and Friedman (1985); Paninski (2003); Székely et al. (2007); Gretton et al. (2005); Reshef et al. (2011); Gretton et al. (2012); Lopez-Paz et al. (2013); Heller et al. (2013); Jiang et al. (2015); Heller et al. (2016). One way to measure the utility of a measure of dependence ˆϕ is power against independence, i.e., the power of independence testing based on ˆϕ to detect various types of non-trivial relationships. This is an important goal for data sets that have very few non-trivial relationships, or only very weak relationships that are diﬃcult to detect. Often, however, the number of relationships declared statistically signiﬁcant by a measure of dependence greatly exceeds the number of relationships that can then be explored further. For example, biological data sets often contain many non-trivial relationships, but further corroborating any one of them may take extensive manual lab work or a study on human or animal subjects. In this case, it is tempting to restrict follow-up to a few relationships with the highest values of ˆϕ, but this can skew the direction of follow-up work: if ˆϕ systematically assigns higher scores to, say, linear relationships than to non-linear ones, relatively noisy linear relationships might crowd out strong non-linear relationships from the top-scoring set. Motivated by this problem, we previously introduced a second way of assessing a measure of dependence, called equitability (Reshef et al., 2011). Informally, an equitable statistic is one that, for some measure of relationship strength, assigns similar scores to equally strong relationships regardless of relationship type. For instance, we may want our measure of dependence to also have the property that on noisy functional relationships it assigns similar scores to relationships with the same R2, i.e., the squared Pearson correlation between the observed y-values and the x-values passed through the underlying function in question (Reshef et al., 2011). Or, alternatively, we may want the value of our statistic to tell us about the proportion of points coming from the deterministic component of a mixture containing part signal and part uniform noise (Ding and Li, 2013). Deﬁning measures of dependence that achieve good equitability with respect to interesting measures of relationship strength

Measuring Dependence Powerfully and Equitably

is a new and challenging problem, with a number of diﬀerent formalizations. (See, e.g., Reshef et al., 2015b and Ding and Li, 2013 cited above, as well as Kinney and Atwal, 2014 along with associated technical comments Reshef et al., 2014 and Murrell et al., 2014.) A companion paper to this work (Reshef et al., 2015b) presents a general formalization that uniﬁes these. In this paper, we introduce and theoretically characterize two new measures of dependence that we empirically show to have good equitability with respect to R2 and power against independence, respectively. We begin by introducing a new population measure of dependence called MIC . Given a pair of jointly distributed random variables (X, Y ), MIC (X, Y ) is the supremum, over all ﬁnite grids G imposed on the support of (X, Y ), of the mutual information of the discrete distribution induced by (X, Y ) on the cells of G, subject to a regularization based on the resolution of G. We prove three results, each of which gives a diﬀerent way that this population quantity can be viewed.

1. MIC is the population value of the maximal information coeﬃcient (MIC), a statistic introduced in Reshef et al. (2011) that is empirically highly equitable with respect to R2 on a large class of noisy functional relationships. Simple corollaries of this result simplify and strengthen many of the theoretical results proven in Reshef et al. (2011) about MIC.

2. MIC is a minimal smoothing of mutual information, in the sense that the regularization in the deﬁnition of MIC renders it uniformly continuous as a function of random variables with respect to statistical distance, and no smaller regularization achieves continuity. This result yields as a corollary that mutual information by itself is not continuous with respect to statistical distance.

3. MIC is the supremum of an inﬁnite sequence deﬁned in terms of optimal (onedimensional) partitions of the marginal distributions of (X, Y ) rather than optimal (two-dimensional) grids imposed on the joint distribution. This characterization greatly simpliﬁes computation.

After proving these three results, we leverage them to introduce eﬃcient algorithms both for approximating MIC in practice and for estimating it consistently from a ﬁnite sample. We ﬁrst provide an eﬃcient algorithm that in many cases allows for computation to arbitrary precision of the MIC of a pair of random variables whose joint density is known. We then introduce a statistic, called MICe, that we prove is a consistent estimator of MIC . In contrast to the MIC statistic from Reshef et al. (2011), for which no eﬃcient algorithm is known and a heuristic algorithm is used in practice, MICe is eﬃciently computable. It has a better runtime complexity than the heuristic algorithm currently in use for computing the original MIC statistic, and is orders of magnitude faster in practice. With a consistent and fast estimator for MIC in hand, we turn to empirical analysis of its performance. Speciﬁcally, we show through simulation that MICe has better bias/variance properties than the heuristic algorithm used in Reshef et al. (2011) for computing MIC, which has no theoretical convergence guarantees. Our analysis also reveals that the main parameter of MICe can be used to tune statistical performance toward either stronger or weaker relationships in general. After studying the bias/variance properties of MICe, we

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

then demonstrate via simulation that it outperforms currently available methods in terms of equitability with respect to R2 on a broad set of noisy functional relationships. We show this performance advantage both on the set of functional relationships analyzed in Reshef et al. (2011) as well as on a large set of randomly chosen noisy functional relationships.

We choose in this paper to analyze equitability speciﬁcally with respect to R2, rather than some other notion of relationship strength, because R2 on noisy functional relationships is a simple measure with broad familiarity and intuitive interpretation among practitioners. Of course, it is also important to develop measures of dependence that are equitable with respect to notions of relationship strength besides R2 or on families of relationships besides noisy functional relationships; however, our focus here remains on the simple case of R2

on noisy functional relationships.

Importantly, we note that although there are methods for directly estimating the R2 of a noisy functional relationship via nonparametric regression (see, e.g., Cleveland and Devlin, 1988; Stone, 1977), those methods are not applicable in the context of equitability because they are not measures of dependence. That is, because non-parametric regression methods assume a functional form for the relationship in question, they can give trivial scores to non-functional relationships, even in the large-sample limit. (A simple example of this is a uniform distribution over a circle, whose regression function is constant.) In contrast, a measure of dependence is guaranteed never to make this mistake . A measure of dependence that is equitable with respect to R2 can therefore be viewed either as an upgraded measure of dependence that also comes with some of the interpretability properties of non-parametric regression, or as an upgraded approximate non-parametric regression method that also has the robustness properties of a measure of dependence.

The main strength of MICe is equitability rather than power to reject a null hypothesis of independence. In some settings, though, it may be more important to focus on good power against independence. We therefore introduce here a statistic closely related to MICe called the total information coeﬃcient and denoted TICe. We prove the consistency of testing for independence using TICe, and show via simulations that it achieves excellent power in practice, performing comparably to or better than current methods on an index suite of relationships from Simon and Tibshirani (2012). Because TICe arises naturally as a sideproduct of the computation of MICe, it is available for free once MICe has been computed. This leads us to propose a data analysis strategy consisting of ﬁrst using TICe to ﬁlter out non-signiﬁcant relationships, and then ranking the remaining ones using the simultaneously computed values of MICe.

In addition to the companion paper Reshef et al. (2015b), which focuses on the theory behind equitability, this paper is accompanied by a second companion work (Reshef et al., 2015a) that explores in detail the empirical performance of the methods introduced here. That paper compares MICe and TICe to several leading measures of dependence (Kraskov et al., 2004; Székely and Rizzo, 2009; Heller et al., 2013, 2016; Gretton et al., 2005; Breiman and Friedman, 1985; Lopez-Paz et al., 2013) on a broad range of relationship types under many diﬀerent sampling and noise models, ﬁnding that the equitability with respect to R2

of MICe and the power of independence testing using TICe are both state-of-the-art on the relationships examined. It also shows that these methods can be computed very fast in practice.

Measuring Dependence Powerfully and Equitably

Taken together, our results shed signiﬁcant light on the theory behind the maximal information coeﬃcient, and suggest that TICe and MICe are a useful pair of methods for data exploration. Speciﬁcally, they point to joint use of these two statistics to ﬁlter and then rank relationships as a fast, practical way to explore large data sets by measuring dependence both powerfully and equitably.

2. Preliminaries

We work extensively in this paper with grids and discrete distributions over their cells. Given a grid G and a point (x, y), we deﬁne the function row G(y) to be the row of G containing y and we deﬁne col G(x) analogously. For a pair (X, Y ) of jointly distributed random variables, we write (X, Y )|G to denote (col G(X), row G(Y )), and we use I((X, Y )|G) to denote the discrete mutual information (Cover and Thomas, 2006; Csiszár and Shields, 2004; Csiszár, 2008) between col G(X) and row G(Y ). Given a ﬁnite sample D from the distribution of (X, Y ), we sometimes use D to refer both to the set of points in the sample as well as to a point chosen uniformly at random from D. In the latter case, it will then make sense to talk about, e.g., D|G and I(D|G). For natural numbers k and ℓ, we use G(k, ℓ) to denote the set of all k-by-ℓgrids (possibly with empty rows/columns). A grid G is an equipartition of (X, Y ) if all the rows of (X, Y )|G have the same probability mass, and all the columns do as well. We also use the term equipartition in the analogous way for one-dimensional partitions into just rows or columns. For a one-dimensional partition P into rows and a one-dimensional partition Q into columns, we write (P, Q) to refer to the grid constructed from these two partitions. When a partition P can be obtained from a partition P by addition of separators alone, we write P P. Finally, let us establish some notation for inﬁnite matrices. We use m to denote the space of inﬁnite matrices equipped with the supremum norm. Given a matrix A m , we often examine only the k, ℓ-th entries of A for which kℓ i for some i. Thus, for i Z+, we deﬁne the projection ri : m m via

ri(A)k,ℓ= Ak,ℓ kℓ i 0 kℓ> i .

Unless noted otherwise, all logarithms are to base 2.

3. The Population Maximal Information Coeﬃcient MIC

In this section, we deﬁne and characterize the population maximal information coeﬃcient MIC . We begin by deﬁning the population quantity MIC (X, Y ) for a pair of jointly distributed random variables (X, Y ). We then show three diﬀerent ways to characterize this population quantity: ﬁrst, as the large-sample limit of the statistic MIC from Reshef et al. (2011); second, as a minimally smoothed version of mutual information; and third, as the supremum of an inﬁnite sequence deﬁned in terms of optimal one-dimensional partitions of the marginal distributions of (X, Y ). We conclude the section by showing how the third characterization leads to an eﬃcient approach for approximating MIC in practice from the density of (X, Y ).

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

3.1 Deﬁning MIC

The population maximal information coeﬃcient can be deﬁned in several equivalent ways, as we will see later. For now, we begin with the simplest deﬁnition.

Deﬁnition 1 Let (X, Y ) be jointly distributed random variables. The population maximal information coeﬃcient (MIC ) of (X, Y ) is deﬁned by

MIC (X, Y ) = sup G

I((X, Y )|G)

where G denotes the minimum of the number of rows of G and the number of columns of G.

Given that I(X, Y ) = sup G I((X, Y )|G) (see, e.g., Chapter 8 of Cover and Thomas 2006), this can be viewed as a regularized version of mutual information that penalizes complicated grids and ensures that the result falls between zero and one. Before we continue, we state one simple equivalent deﬁnition of MIC that is useful for the results in this section. This deﬁnition views MIC as the supremum of a matrix called the population characteristic matrix, deﬁned below.

Deﬁnition 2 Let (X, Y ) be jointly distributed random variables. Let

I ((X, Y ), k, ℓ) = max G G(k,ℓ) I((X, Y )|G).

The population characteristic matrix of (X, Y ), denoted by M(X, Y ), is deﬁned by

M(X, Y )k,ℓ= I ((X, Y ), k, ℓ)

log min{k, ℓ}

for k, ℓ> 1.

It is easy to see the following:

Proposition 3 Let (X, Y ) be jointly distributed random variables. We have

MIC (X, Y ) = sup M(X, Y )

where M(X, Y ) is the population characteristic matrix of (X, Y ).

The population characteristic matrix is so named because just as MIC , the supremum of this matrix, captures a sense of relationship strength, other properties of this matrix correspond to diﬀerent properties of relationships. For instance, later in this paper we introduce an additional property of the characteristic matrix, the total information coeﬃcient, that is useful for testing for the presence or absence of a relationship rather than quantifying relationship strength.

Measuring Dependence Powerfully and Equitably

3.2 First Alternate Characterization: MIC Is the Population Value of MIC

With MIC deﬁned, we now state our ﬁrst alternate characterization of it, as the largesample limit of the statistic MIC introduced in Reshef et al. (2011). We begin by ﬁrst reproducing a description of MIC from Reshef et al. (2011), via the two deﬁnitions below.

Deﬁnition 4 (Reshef et al., 2011) Let D R2 be a set of ordered pairs. The sample characteristic matrix c M(D) of D is deﬁned by

c M(D)k,ℓ= I (D, k, ℓ) log min{k, ℓ}.

Deﬁnition 5 (Reshef et al., 2011) Let D R2 be a set of n ordered pairs, and let B : Z+ Z+. We deﬁne MICB(D) = max kℓ B(n) c M(D)k,ℓ.

where the function B(n) is speciﬁed by the user. In Reshef et al. (2011), it was suggested that B(n) be chosen to be nα for some constant α in the range of 0.5 to 0.8. (The statistics we introduce later will have an analogous parameter; see Section 4.4.1.) We show the following result about convergence of functions of the sample characteristic matrix to their population counterparts, a consequence of which is the convergence of MIC to MIC . (In the theorem statement below, recall that m is the space of inﬁnite matrices equipped with the supremum norm, and given a matrix A the projection ri zeros out all the entries Ak,ℓfor which kℓ> i.)

Theorem 6 Let f : m R be uniformly continuous, and assume that f ri f pointwise. Then for every random variable (X, Y ), we have

f r B(n) c M(Dn) f(M(X, Y ))

in probability where Dn is a sample of size n from the distribution of (X, Y ), provided ω(1) < B(n) O(n1 ε) for some ε > 0.

Proof See Appendix A.

Since the supremum of a matrix is uniformly continuous as a function on m and can be realized as the limit of maxima of larger and larger segments of the matrix, this theorem yields our claim about MIC as a corollary.

Corollary 7 MICB is a consistent estimator of MIC provided ω(1) < B(n) O(n1 ε) for some ε > 0.

Though Theorem 6 is proven in Appendix A, we provide here some intuition for why it should hold as well as a description of the obstacles that must be overcome in the proof. For concreteness, suppose f is the supremum function. To see why the theorem should hold, ﬁx a random variable (X, Y ) and let D be a sample of size n from its distribution. It is known that for a ﬁxed grid G I(D|G) is a consistent estimator of I((X, Y )|G) (Roulston,

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

1999; Paninski, 2003). We might therefore expect I (D, k, ℓ) to be a consistent estimator of I ((X, Y ), k, ℓ) as well. And if I (D, k, ℓ) is a consistent estimator of I ((X, Y ), k, ℓ), then we might expect the maximum of the sample characteristic matrix (which just consists of normalized I terms) to be a consistent estimator of the supremum of the true characteristic matrix. These intuitions turn out to be true, but there are two reasons they are non-trivial to prove. First, consistency for I does not follow from abstract considerations since the supremum of an inﬁnite set of estimators is not necessarily a consistent estimator of the supremum of the estimands.1 Second, consistency of I alone does not suﬃce to show that the maximum of the sample characteristic matrix converges to MIC . In particular, if B(n) grows too quickly, and the convergence of I (D, k, ℓ) to I ((X, Y ), k, ℓ) is slow, inﬂated values of MIC can result. To see this, notice that if B(n) = then MIC = 1 for uniformly generated noise at any ﬁnite sample size, even though each individual entry of the sample characteristic matrix converges to its true value eventually. The technical heart of the proof is overcoming these obstacles by using the dependencies between the quantities I(D|G) for diﬀerent grids G to not only show the consistency of I (D, k, ℓ) but then to quantify how quickly I (D, k, ℓ) converges to I ((X, Y ), k, ℓ).

3.3 Second Alternate Characterization: MIC Is a Minimally Smoothed Mutual Information

We now describe a second equivalent view of MIC . Recall that for a pair of jointly distributed random variables (X, Y ), we deﬁned MIC (X, Y ) as

MIC (X, Y ) = sup G

I((X, Y )|G)

where G denotes the minimum of the number of rows of G and the number of columns of G. As we discussed in Section 3.1, the mutual information I(X, Y ) is also a supremum, namely I(X, Y ) = sup G I((X, Y )|G).

and so MIC can be viewed as a regularized version of I. It is natural to ask whether the regularization in the deﬁnition of MIC has any smoothing eﬀect on I. In this sub-section we show ﬁrst that it does, in the sense that MIC is uniformly continuous as a function of random variables with respect to the metric of statistical distance,2 and second that the regularization by log G is in some sense the minimal one necessary for achieving any sort of continuity. As a corollary, we obtain that I by itself is not continuous as a function of

1. If ˆθ1, . . . , ˆθk is a ﬁnite set of estimators, then a union bound shows that the random variable (ˆθ1(D), . . . , ˆθk(D)) converges in probability to (θ1, . . . , θk) with respect to the supremum metric. The continuous mapping theorem then gives the desired result. However, if the set of estimators is inﬁnite, the union bound cannot be employed. And indeed, if we let θ1 = = θk = 0, and let ˆθi(Dn) = i/n deterministically, then each ˆθi is a consistent estimator of θi, but since the set {ˆθ1(Dn), ˆθ2(Dn), . . .} = {1/n, 2/n, . . .} is unbounded, supi ˆθi(Dn) = for every n. 2. Recall that the statistical distance between random variables A and B is deﬁned as sup T |P (A T) P (B T)|. When A and B have probability density functions or probability mass functions, this equals one-half of the L1 distance between those functions.

Measuring Dependence Powerfully and Equitably

random variables with respect to the metric of statistical distance. This provides a view of MIC as a canonical smoothing of I that yields continuity. Formally, let P(R2) denote the space of random variables supported on R2 equipped with the metric of statistical distance. Our ﬁrst claim is that as a function deﬁned on P(R2), MIC is uniformly continuous. We prove this claim by establishing a stronger result: the uniform continuity of the characteristic matrix M(X, Y ). Speciﬁcally, by showing that the family of maps corresponding to each individual entry of the characteristic matrix is uniformly equicontinuous, we obtain the following result.

Theorem 8 The map from P(R2) to m deﬁned by (X, Y ) 7 M(X, Y ) is uniformly continuous.

Proof See Appendix B.

Since the supremum is a uniformly continuous function on m , Theorem 8 yields the following corollary.

Corollary 9 The map (X, Y ) 7 MIC (X, Y ) is uniformly continuous.

Similar corollaries exist for any uniformly continuous function of the characteristic matrix. Interestingly, Theorem 8 relies crucially on the normalization in the deﬁnition of the characteristic matrix. This is not a coincidence: as the following proposition shows, any normalization that is meaningfully smaller than the one in the deﬁnition of the characteristic matrix will cause the matrix to contain a discontinuity as a function on P(R2).

Proposition 10 For some function N(k, ℓ), let MN be the characteristic matrix with normalization N, i.e.,

MN(X, Y )k,ℓ= I ((X, Y ), k, ℓ)

If N(k, ℓ) = o(log min{k, ℓ}) along some inﬁnite path in N N, then MN and sup MN are not continuous as functions of P([0, 1] [0, 1]) P(R2).

Proof See Appendix C.

The above proposition implies that the smoothing that MIC applies to mutual information is necessary in some sense. In particular, one corollary of the proposition is that mutual information with no smoothing will contain a disconuity.

Corollary 11 Mutual information is not continuous on P([0, 1] [0, 1]) P(R2).

Proof Mutual information is the supremum of MN with N 1.

The same result can also be shown for the squared Linfoot correlation (Speed, 2011; Linfoot, 1957), which equals 1 2 2I where I represents mutual information. Thus, though the Linfoot correlation smoothes the mutual information enough to cause it to lie in the unit interval, it does not smooth the mutual information suﬃciently to cause it to be continuous.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

As we remarked previously, these results, when contrasted with the uniform continuity of MIC , allow us to view the latter as a canonical minimally smoothed version of mutual information that is uniformly continuous. This view gives a meaningful interpretation to the normalization used in MIC . Understanding MIC as having smoothness properties not shared by mutual information also suggests that estimators of MIC may have better statistical properties than estimators of ordinary mutual information. This is consistent with a recent hardness-of-estimation result for mutual information in Ding and Li (2013) and is also borne out empirically in Reshef et al. (2015a).

3.4 Third Alternate Characterization: MIC Is the Supremum of the Boundary of the Characteristic Matrix

We now show the third alternate view of MIC : that it can be equivalently deﬁned as the supremum over a boundary of the characteristic matrix rather than as a supremum over all of the entries of the matrix. This characterization of MIC will serve as the foundation both for our approach to approximating MIC (X, Y ) as well as the new estimator of MIC that we introduce later in this paper. We begin by deﬁning what we mean by the boundary of the characteristic matrix. Our deﬁnition rests on the following observation.

Proposition 12 Let M be a population characteristic matrix. Then for ℓ k, Mk,ℓ Mk,ℓ+1.

Proof Let (X, Y ) be the random variable in question. Since we can always let a row/column be empty, we know that I ((X, Y ), k, ℓ) I ((X, Y ), k, ℓ+ 1). And since ℓ, ℓ+ 1 k, we know that Mk,ℓ= I ((X, Y ), k, ℓ)/ log k I ((X, Y ), k, ℓ+ 1)/ log k = Mk,ℓ+1.

Since the entries of the characteristic matrix are bounded, the monotone convergence theorem then gives the following corollary. In the corollary and henceforth, we let Mk, = limℓ Mk,ℓand deﬁne M ,ℓsimilarly.

Corollary 13 Let M be a population characteristic matrix. Then Mk, exists, is ﬁnite, and equals supℓ k Mk,ℓ. The same is true for M ,ℓ.

The above corollary allows us to deﬁne the boundary of the characteristic matrix.

Deﬁnition 14 Let M be a population characteristic matrix. The boundary of M is the set

M = {Mk, : 1 < k < } [ {M ,ℓ: 1 < ℓ< }.

The theorem below then gives a relationship between the boundary of the characteristic matrix and MIC .

Theorem 15 Let (X, Y ) be a random variable. We have

MIC (X, Y ) = sup M(X, Y )

where M(X, Y ) is the population characteristic matrix of (X, Y ).

Measuring Dependence Powerfully and Equitably

Proof The following argument shows that every entry of M is at most sup M: ﬁx a pair (k, ℓ) and notice that either k ℓ, in which case Mk,ℓ Mk, , or ℓ k, in which case Mk,ℓ M ,ℓ. Thus, MIC sup{M ,ℓ} {Mk, } = sup M. On the other hand, Corollary 13 shows that each element of M is a supremum over some elements of M. Therefore, sup M, being a supremum over suprema of elements of M, cannot exceed sup M = MIC .

3.5 Approximating MIC in Practice

The importance of the characterization in Theorem 15 from the previous sub-section is computational. Speciﬁcally, elements of the boundary of the characteristic matrix can be expressed in terms of a maximization over (one-dimensional) partitions rather than (twodimensional) grids, the former being much quicker to compute exactly. This is stated in the theorem below.

Theorem 16 Let M be a population characteristic matrix. Then Mk, equals

max P P(k) I(X, Y |P )

where P(k) denotes the set of all partitions of size at most k.

Proof See Appendix D.

To formally state how this will help us from an algorithmic standpoint, we note that Theorems 15 and 16 above together give the following corollary.

Corollary 17 Let (X, Y ) be a random variable, and let P be the set of ﬁnite-size partitions. Then

MIC (X, Y ) = sup I(X, Y |P )

log |P| : P P [ I(X|P , Y )

log |P| : P P

where |P| is the number of bins in the partition P.

We can exploit the fact that the expressions in the above corollary involve maximization only over one-dimensional partitions rather than two-dimensional grids to give an algorithm for computing elements of the boundary of the characteristic matrix to arbitrary precision, and by extension an approach to approximating MIC in practice. To do so, we utilize as a subroutine a dynamic programming algorithm from Reshef et al. (2011) called Optimize XAxis. Before continuing, we therefore give a brief overview of that algorithm. Overview of Optimize XAxis algorithm from Reshef et al. (2011). The Optimize XAxis algorithm takes as input a set D of n data points, a ﬁxed partition into columns3 Q of size ℓ, a master partition into rows Π, and a number k. The algorithm returns, for

3. Despite its name, the Optimize XAxis algorithm can be used to optimize a partition of either axis. In our description of the algorithm here, we choose to describe the algorithm as it would work for optimizing a partition of the y-axis rather than the x-axis. This is for notational coherence of this paper only.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

2 i k, the partition into rows Pi Π that maximizes the mutual information of D|(Pi,Q) among all sub-partitions of Π of size at most i. The algorithm works by exploiting the fact that, conditioned on the location y of the top-most line of Pi, the optimization of the rest of Pi can be formulated as a sub-problem that depends only on the data points below y. The algorithm uses dynamic programming to store and reuse solutions to these subproblems, resulting in a runtime of O(|Π|2kℓ). If a black-box algorithm is used to compute each required mutual information in time at most T, then the runtime of the algorithm can be shown to be O(Tk|Π|). The following theorem shows that the theory developed about the boundary of the characteristic matrix, together with Optimize XAxis, yields an eﬃcient algorithm for computing entries of the boundary to arbitrary precision.

Theorem 18 Given a random variable (X, Y ), Mk, (resp. M ,ℓ) is computable to within an additive error of O(kε log(1/(kε))) + E (resp. O(ℓε log(1/(ℓε))) + E) in time O(k T(E)/ε) (resp. O(ℓT(E)/ε)), where T(E) is the time required to numerically compute the mutual information of a continuous distribution to within an additive error of E.

Proof See Appendix E.

The algorithm proposed in Theorem 18 gives us a polynomial-time method for computing any ﬁnite subset of the boundary M of the population characteristic matrix M(X, Y ) of a random variable (X, Y ). Thus, if we have some k0, ℓ0 such that the maximum of the ﬁnite subset {Mk, , M ,ℓ: k k0, ℓ ℓ0} of M will be ε-close to the supremum of the entire set M, we can compute MIC (X, Y ) to within an error of ε. Though we usually do not have precise knowledge of k0 and ℓ0, for many distributions it is often easy to make very conservative educated guesses for them, in which case this algorithm allows us to approximate MIC (X, Y ) very well in practice. Being able to compute MIC (X, Y ) to arbitrary precision in some cases has two main advantages. The ﬁrst advantage is that it allows us to assess in simulations the large-sample properties of MIC independent of any estimator. This is done in the companion paper (Reshef et al., 2015a), which shows that MIC achieves high equitability with respect to R2 on a set of noisy functional relationships thereby conﬁrming that statistically eﬃcient estimation of MIC is a worthwhile goal. The second advantage is that we can empirically assess the bias, variance, and expected squared error of estimators of MIC by taking a distribution, computing MIC , and then comparing the result to estimates of it based on ﬁnite samples. In the next section, we introduce a new estimator MICe of MIC and carry out such an analysis to compare its statistical properties to those of the statistic MIC from Reshef et al. (2011).

4. Estimating MIC with MICe

As we have shown, MIC is the population value of the statistic MIC introduced in Reshef et al. (2011). However, though consistent, the statistic MIC is not known to be eﬃciently computable and in Reshef et al. (2011) a heuristic approximation algorithm called Approx MIC was computed instead. In this section, we leverage the theory we have developed here

Measuring Dependence Powerfully and Equitably

to introduce a new estimator of MIC that is both consistent and eﬃciently computable. The new estimator, called MICe, has better runtime complexity even than the heuristic Approx-MIC algorithm, and runs orders of magnitude faster in practice. The estimator MICe is based on one of the alternate characterizations of MIC proven in the previous section. Namely, if MIC can be viewed as the supremum of the boundary of the characteristic matrix rather than of the entire matrix, then only the boundary of the matrix must be accurately estimated in order to estimate MIC . This has the advantage that, whereas computing individual entries of the sample characteristic matrix involves ﬁnding optimal (two-dimensional) grids, estimating entries of the boundary requires us only to ﬁnd optimal (one-dimensional) partitions. While the former problem is computationally diﬃcult, the latter can be solved using the dynamic programming algorithm from Reshef et al. (2011) that we also employed in Section 3.5 to compute MIC to arbitrary precision in the large-sample limit. We formalize this idea via a new object called the equicharacteristic matrix, which we denote by [M]. The diﬀerence between [M] and the characteristic matrix M is as follows: while the k, ℓ-th entry of M is computed from the maximal achievable mutual information using any k-by-ℓgrid, the k, ℓ-th entry of [M] is computed from the maximal achievable mutual information using any k-by-ℓgrid that equipartitions the dimension with more rows/columns. (See Figure 1.) Despite this diﬀerence, as the equipartition in question gets ﬁner and ﬁner it becomes indistinguishable from an optimal partition of the same size. This intuition can be formalized to show that the boundary of [M] equals the boundary of M, and therefore that sup[M] = sup M = MIC . It will then follow that estimating [M] and taking the supremum as we did with M in the case of MIC yields a consistent estimate of MIC .

4.1 The Equicharacteristic Matrix

We now deﬁne the equicharacteristic matrix and show that its supremum is indeed MIC . To do so, we ﬁrst deﬁne a version of I that equipartitions the dimension with more rows/columns. Note that in the deﬁnition, brackets are used to indicate the presence of an equipartition.

Deﬁnition 19 Let (X, Y ) be jointly distributed random variables. Deﬁne

I ((X, Y ), k, [ℓ]) = max G G(k,[ℓ]) I ((X, Y )|G)

where G(k, [ℓ]) is the set of k-by-ℓgrids whose y-axis partition is an equipartition of size ℓ. Deﬁne I ((X, Y ), [k], ℓ) analogously. Deﬁne I[ ]((X, Y ), k, ℓ) to equal I ((X, Y ), k, [ℓ]) if k < ℓand I ((X, Y ), [k], ℓ) otherwise.

We now deﬁne the equicharacteristic matrix in terms of I[ ]. In the deﬁnition below, we continue our convention of using brackets to denote the presence of equipartitions.

Deﬁnition 20 Let (X, Y ) be jointly distributed random variables. The population equicharacteristic matrix of (X, Y ), denoted by [M](X, Y ), is deﬁned by

[M](X, Y )k,ℓ= I[ ]((X, Y ), k, ℓ)

log min{k, ℓ}

for k, ℓ> 1.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

M(X, Y )2,3 [M](X, Y )2,3 I = 0.918 I[ ] = 0.613

M(X, Y )2,9 [M](X, Y )2,9 I = 0.918 I[ ] = 0.918

Figure 1: A schematic illustrating the diﬀerence between the characteristic matrix M and the equicharacteristic matrix [M]. (Top) When restricted to 2 rows and 3 columns, the characteristic matrix M is computed from the optimal 2-by-3 grid. In contrast, the equicharacteristic matrix [M] still optimizes the smaller partition of size 2 but is restricted to have the larger partition be an equipartition of size 3. This results in a lower mutual information of 0.613. (Bottom) When 9 columns are allowed instead of 3, the grid found by the characteristic matrix does not change, since the grid with 3 columns was already optimal. However, now the equicharacteristic matrix uses an equipartition into columns of size 9, whose resolution is able to fully capture the dependence between X and Y .

The boundary of the equicharacteristic matrix can be deﬁned via a limit in the same way as the characteristic matrix. We then have the following theorem.

Theorem 21 Let (X, Y ) be jointly distributed random variables. Then [M] = M.

Proof See Appendix F.

Since every entry of the equicharacteristic matrix is dominated by some entry on its boundary, the equivalence of [M] and M yields the following corollary as a simple consequence.

Corollary 22 Let (X, Y ) be jointly distributed random variables. Then sup[M](X, Y ) = MIC (X, Y ).

Measuring Dependence Powerfully and Equitably

4.2 The Estimator MICe

With the equicharacteristic matrix deﬁned, we can now deﬁne our new estimator MICe in terms of the sample equicharacteristic matrix, analogously to the way we deﬁned MIC in terms of the sample characteristic matrix.

Deﬁnition 23 Let D R2 be a set of ordered pairs. The sample equicharacteristic matrix d [M](D) of D is deﬁned by

d [M](D)k,ℓ= I[ ](D, k, ℓ)

log min{k, ℓ}.

Deﬁnition 24 Let D R2 be a set of n ordered pairs, and let B : Z+ Z+. We deﬁne

MICe,B(D) = max kℓ B(n) d [M](D)k,ℓ.

With the equivalence between the boundary of the characteristic matrix and that of the equicharacteristic matrix established, it is straightforward to show that MICe is a consistent estimator of MIC via arguments similar to those we applied in the case of MIC. (See Appendix G.) Speciﬁcally, we show the following theorem, an analogue of Theorem 6.

Theorem 25 Let f : m R be uniformly continuous, and assume that f ri f pointwise. Then for every random variable (X, Y ), we have

f r B(n) d [M](Dn) f([M](X, Y ))

in probability where Dn is a sample of size n from the distribution of (X, Y ), provided ω(1) < B(n) O(n1 ε) for some ε > 0.

By setting f([M]) = sup[M], we then obtain as a corollary the consistency of MICe.

Corollary 26 MICe,B is a consistent estimator of MIC provided ω(1) < B(n) O(n1 ε) for some ε > 0.

As with the statistic MIC, the statistic MICe requires the user to specify a function B(n) to use. While the theory suggests that any function of the form B(n) = nα suﬃces provided 0 < α < 1, diﬀerent values of α may yield diﬀerent ﬁnite-sample properties. We study the empirical performance of MICe for diﬀerent choices of B(n) in Section 4.4 and point the reader to speciﬁc recommendations for practical use in Section 4.4.1.

4.3 Computing MICe

Both MIC and MICe are consistent estimators of MIC . The diﬀerence between them is that while MIC can currently be computed eﬃciently only via a heuristic approximation, MICe can be computed exactly, very eﬃciently, via an approach similar to the one used for approximating MIC involving the Optimize XAxis subroutine. We now describe the details of this approach. Recall that, given a ﬁxed x-axis partition Q into ℓcolumns, a set of n data points, a master y-axis partition Π, and a number k, the Optimize XAxis subroutine ﬁnds, for

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

every 2 i k, a y-axis partition Pi Π of size at most i that maximizes the mutual information induced by the grid (Pi, Q). The algorithm does this in time O(|Π|2kℓ). (For more discussion of Optimize XAxis, see Section 3.5) In the pair of theorems below, we show two ways that Optimize XAxis can be used to compute MICe eﬃciently. In the proofs of both theorems, we neglect issues of divisibility, e.g., we often write B/2 rather than B/2 . This does not aﬀect the results.

Theorem 27 There exists an algorithm Equichar that, given a sample D of size n and some B Z+, computes the portion r B(n)(d [M](D)) of the sample equicharacteristic matrix in time O(n2B2), which equals O(n4 2ε) for B(n) = O(n1 ε) with ε > 0.

Proof We describe the algorithm and simultaneously bound its runtime. We do so only for the k, ℓ-th entries of d [M](D) satisfying k ℓ, kℓ B. This suﬃces, since by symmetry computing the rest of the required entries at most doubles the runtime. To compute d [M](D)k,ℓwith k ℓ, we must ﬁx an equipartition into ℓcolumns on the x-axis and then ﬁnd the optimal partition of the y-axis of size at most k. If we set the master partition Π of the Optimize XAxis algorithm to be an equipartition into rows of size n, then it performs precisely the required optimization. Moreover, for ﬁxed ℓit can carry out the optimization simultaneously for all of the pairs {(2, ℓ), . . . , (B/ℓ, ℓ)} in time O(|Π|2(B/ℓ)ℓ) = O(n2B). For ﬁxed ℓ, this set contains all the pairs (k, ℓ) satisfying k ℓ, kl B. Therefore, to compute all the required entries of d [M](D) we need only apply this algorithm for each ℓ= 2, . . . , B/2. Doing so gives a runtime of O(n2B2).

The algorithm above, while polynomial-time, is nonetheless not eﬃcient enough for use in practice. However, a simple modiﬁcation solves this problem without aﬀecting the consistency of the resulting estimates. The modiﬁcation hinges on the fact that Optimize XAxis can use master partitions Π besides the equipartition of size n that we used above. Speﬁcally, setting Π in the above algorithm to be an equipartition into ck clumps , where k is the size of the largest optimal partition being sought, speeds up the computation signiﬁcantly. This modiﬁcation gives a slightly diﬀerent statistic, but one that has all of the theoretical properties of MICe namely, consistent estimation of MIC and eﬃcient exact computation. These properties are formalized in the following theorem.

Theorem 28 Let (X, Y ) be a pair of jointly distributed random variables, and let Dn be a sample of size n from the distribution of (X, Y ). For every c 1, there exists a matrix {c M}c(Dn) such that

1. The function MICe,B( ) = max kℓ B(n){c M}c( )k,ℓ

is a consistent estimator of MIC provided ω(1) < B(n) O(n1 ε) for some ε > 0.

2. There exists an algorithm Equichar Clump for computing r B({c M}c(Dn)) in time O(n + B5/2), which equals O(n + n5(1 ε)/2) when B(n) = O(n1 ε).

Measuring Dependence Powerfully and Equitably

Proof See Appendix H.

For an analysis of the eﬀect of the parameter c in the above theorem on the results of the Equichar Clump algorithm, see Appendix H.3. Setting ε = 0.6 in the above theorem yields the following corollary.

Corollary 29 MIC can be estimated consistently in linear time.

Of course, at low sample sizes, setting ε = 0.6 would be undesirable. However, our companion paper (Reshef et al., 2015a) shows empirically that at large sample sizes this strategy works very well on typical relationships. We remark that the Equichar Clump algorithm given above is asymptotically faster even than the heuristic Approx-MIC algorithm used to calculate MIC in practice, which runs in time O(B(n)4). As demonstrated in our companion paper (Reshef et al., 2015a), this diﬀerence translates into a substantial diﬀerence in runtimes for similar performance at a range of realistic sample sizes, ranging from a 30-fold speedup at n = 500 to over a 350-fold speedup at n = 10, 000. For readability, in the rest of this paper we do not distinguish between the two versions of MICe computed by the Equichar and Equichar Clump algorithms described above. Wherever we present simulation data about MICe in simulations though, we use the version of the statistic computed by Equichar Clump.

4.4 Bias/Variance Characterization of MICe

The algorithm we presented in Section 3.5 for computing MIC to arbitrary precision in some cases allows us to examine the bias/variance properties of estimators of MIC . Here, we use it to examine the bias and variance of both MIC as computed by the heuristic Approx MIC algorithm from Reshef et al. (2011), and MICe as computed by the Equichar Clump algorithm given above. To do this, we performed a simulation analysis on the following set of relationships

Q = {(x + εσ, f(x) + ε σ) : x Xf, εσ, ε σ N(0, σ2), f F, σ R 0}

where εσ and ε σ are i.i.d., F is the set of 16 functions analyzed in Reshef et al. (2011), and Xf is the set of n x-values that result in the points (xi, f(xi)) being equally spaced along the graph of f. For each relationship Z Q that we examined, we used the algorithm from Theorem 18 with very conservative values of k0 and ℓ0 to compute MIC . We then simulated 500 independent samples from Z, each of size n = 500, and computed both Approx-MIC and MICe on each one to obtain estimates of the sampling distributions of the two statistics. From each of the two sampling distributions, we estimated the bias and variance of either statistic on Z. We then analyzed the bias, variance, and expected squared error of the two statistics as a function of relationship strength, which we quantiﬁed using the coeﬃcient of determination (R2) with respect to the generating function. The results, presented in Figure 2, are interesting for two reasons. First, they demonstrate that for a typical usage parameter of B(n) = n0.6, MICe performs substantially better than Approx-MIC overall. Speciﬁcally, the median of the expected squared error of MICe

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

across the set F of functions is uniformly lower across R2 values than that of Approx-MIC. When average expected squared error is used instead of median, MICe still performs better on all but the strongest of relationships (R2 above 0.9). The superior performance of MICe is consistent with the fact that we have theoretical guarantees about its statistical properties whereas Approx-MIC is a heuristic.

Second, the results show that diﬀerent values of the exponent in B(n) = nα give good performance in diﬀerent signal-to-noise regimes due to a bias-variance trade-oﬀrepresented by this parameter. We expand on this phenomenon and discuss its implications for choosing α in practice below.

4.4.1 Choosing B(n)

Large values of α lead to increased expected error in lower-signal regimes (low R2) through both a positive bias in those regimes and a general increase in variance that predominantly aﬀects those regimes. On the other hand, small values of α lead to an increased expected error in higher-signal regimes (high R2) by leading to a negative bias in those regimes and by shifting the variance of the estimator toward those regimes. In other words, lower values of α are better suited for detecting weaker signals, while higher values of α are better suited for distinguishing among stronger signals. This is consistent with the results seen in our companion paper (Reshef et al., 2015a), which show that low values of α cause MICe to yield better powered independence tests while high values of α cause MICe to have better equitability.

Reshef et al. (2015a) provides simple, empirical recommendations about appropriate values of α for diﬀerent settings. Those recommendations are formulated by choosing a set of representative relationships (e.g., a set of noisy functional relationships), as well as a ground truth population quantity Φ (e.g., R2) that can be used to quantify the strength of each of those relationships, and then assessing which values of α maximize the equitability of MICe with respect to Φ at a given sample size. This approach is applied to an analysis of real data from the World Health Organization in Reshef et al. (2015a), and the parameters chosen for that analysis are the ones used for all subsequent analyses in this paper.

We remark that if the goal of the user is only detection of non-trivial relationships rather discovery of the strongest such relationships, α can also be chosen in a more straightforward manner: the user can subsample a small random set of relationships on which to compare the power of MICe for diﬀerent values of α. Those relationships can then be discarded and the rest of the relationships analyzed with the optimal value of α. However, if the user s primary goal is power against independence, the statistic TICe introduced in Section 5 of this paper should be used with this strategy rather than MICe.

4.5 Equitability of MICe

As mentioned previously, one of the main motivations for the introduction of MIC was equitability, the extent to which a measure of dependence usefully captures some notion of relationship strength on some set of standard relationships. We therefore carried out an empirical analysis of the equitability of MICe with respect to R2 and compared its performance to distance correlation (Székely et al., 2007; Székely and Rizzo, 2009), mutual

Measuring Dependence Powerfully and Equitably

Figure 2: Bias/variance characterization of Approx-MIC and MICe. Each plot shows expected squared error, bias, or variance across the set of noisy functional relationships described in Section 4.4 as a function of the R2 of the relationships. The results are aggregated across the 16 function types analyzed by either the average, median, or worst result at every value of R2. (a) A comparison between MICe (light purple) and MIC as computed via the heuristic Approx-MIC algorithm (black), at a typical usage parameter. (b) Performance of MICe with B(n) = nα

for various values of α.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

information estimation (Kraskov et al., 2004), and maximal correlation estimation (Breiman and Friedman, 1985). We began by assessing equitability on the set of relationships Q deﬁned above, a set that has been analyzed in previous work (Reshef et al., 2011, 2015a; Kinney and Atwal, 2014). The results, shown in Figure 3, conﬁrm the superior equitability of the new estimator MICe on this set of relationships. To assess equitability more objectively without relying on a manually curated set of functions, we then analyzed 160 random functions drawn from a Gaussian process distribution with a radial basis function kernel with one of eight possible bandwidths in the set {0.01, 0.025, 0.05, 0.1, 0.2, 0.25, 0.5, 1} to represent a range of possible relationship complexities. The results, shown in Figure 4, show that MICe outperforms existing methods in terms of equitability with respect to R2 on these functions as well. Appendix Figure J1 shows a version of this analysis under a diﬀerent noise model that yields the same conclusion. We also examined the eﬀect of outlier relationships on our results by repeatedly subsampling random subsets of 20 functions from this large set of relationships and measuring the equitability of each method on average over the subsets; results were similar. One feature of the performance of MICe on these randomly chosen relationships that is demonstrated in Figure 4 is that it appears minimally sensitive to the bandwidth of the Gaussian process from which a given relationship is drawn. This puts it in contrast to, e.g., mutual information estimation, which shows a pronounced sensitivity to this parameter that prevents it from being highly equitable when relationships with diﬀerent bandwidths are present in the same data set. In our companion paper (Reshef et al., 2015a), we perform more in-depth analyses of the equitability with respect to R2 of MICe, MIC, and the four measures of dependence described above as well as the Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005, 2007), the Heller-Heller-Gorﬁne (HHG) test (Heller et al., 2013), the data-derived partitions (DDP) test (Heller et al., 2016), and the randomized dependence coeﬃcient (RDC) (Lopez-Paz et al., 2013). These analyses consider a range of sample sizes, noise models, marginal distributions, and parameter settings. They conclude that, in terms of equitability with respect to R2 on the sets of noisy functional relationships studied, a) MICe uniformly outperforms MIC, and b) MICe outperforms all the methods tested in the large majority of settings examined. Appendix Figure I1 contains a reproduction of a representative equitability analysis from that paper for the reader s reference.

5. The Total Information Coeﬃcient

So far we have presented results about estimators of the population maximal information coeﬃcient, a quantity for which equitability is the primary motivation. We now introduce and analyze a new measure of dependence, the total information coeﬃcient (TIC). In contrast to the maximal information coeﬃcient, the total information coeﬃcient is designed not for equitability but rather as a test statistic for testing a null hypothesis of independence. We begin by giving some intuition. Recall that the maximal information coeﬃcient is the supremum of the characteristic matrix. While estimating the supremum of this matrix has many advantages, this estimation involves taking a maximum over many estimates of individual entries of the characteristic matrix. Since maxima of sets of random variables

Measuring Dependence Powerfully and Equitably

Ф (e.g. R2)

Figure 3: Equitability with respect to R2 on a set of noisy functional relationships of (a) the Pearson correlation coeﬃcient, (b) a hypothetical measure of dependence ϕ with perfect equitability, (c) distance correlation, (d) MICe, (e) maximal correlation estimation, and (f) mutual information estimation. For each relationship, a shaded region denotes estimated 5th and 95th percentile values of the sampling distribution of the statistic in question on that relationship at every R2. The resulting plot shows which values of R2 correspond to a given value of each statistic. The red interval on each plot indicates the widest range of R2 values corresponding to any one value of the statistic; the narrower the red interval, the higher the equitability. A red interval with width 0, as in (b), means that the statistic reﬂects only R2 with no dependence on relationship type, as demonstrated by the pairs of thumbnails of relationships of diﬀerent types with identical R2 values.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Figure 4: Equitability of methods examined on functions randomly drawn from a Gaussian process distribution. Each method is assessed as in Figure 3, with a red interval indicating the widest range of R2 values corresponding to any one value of the statistic; the narrower the red interval, the higher the equitability. Each shaded region corresponds to one relationship, and the regions are colored by the bandwidth of the Gaussian process from which they were sampled. Sample relationships for each bandwidth are shown in the top right with matching colors.

Measuring Dependence Powerfully and Equitably

tend to become large as the number of variables grows, one can imagine that this procedure may lead to an undesirable positive bias in the case of statistical independence, when the population characteristic matrix equals 0. This might be detrimental for independence testing, when the sampling distribution of a statistic under a null hypothesis of independence is crucial. The intuition behind the total information coeﬃcient is that if we instead consider a more stable property, such as the sum of the entries in the characteristic matrix, we might expect to obtain a statistic with a smaller bias in the case of independence and therefore better power. Stated diﬀerently, if our only goal is to distinguish any dependence at all from complete noise, then disregarding all of the sample characteristic matrix except for its maximal value may throw away useful signal, and the total information coeﬃcient avoids this by summing all the entries. We remark that in Reshef et al. (2011) it is suggested that other properties of the characteristic matrix may allow us to measure other aspects of a given relationship besides its strength, and several such properties were deﬁned. The total information coeﬃcient ﬁts within this conceptual framework. In this section we deﬁne the total information coeﬃcient in the case of both the characteristic matrix (TIC) and the equicharacteristic matrix (TICe). We then prove that both TIC and TICe yield independence tests that are consistent against all dependent alternatives. (As in the case of MIC and MICe, TICe is more easily computable than TIC.) Finally, we present a simulation study of the power of independence testing based on TICe on an index set of relationships chosen in Simon and Tibshirani (2012), showing that TICe outperforms other common measures of dependence on many of the relationships and closely matches their performance on the rest.

5.1 Deﬁnition and Consistency of the Total Information Coeﬃcient

We begin by deﬁning the two versions of the total information coeﬃcient. In the deﬁnition below, recall that c M denotes a sample characteristic matrix whereas d [M] denotes a sample equicharacteristic matrix.

Deﬁnition 30 Let D R2 be a set of n ordered pairs, and let B : Z+ Z+. We deﬁne

TICB(D) = X

and TICe,B(D) = X

d [M](D)k,ℓ.

To show that these two statistics lead to consistent independence tests, we must take a step back and analyze the behavior of the analogous population quantities.

Deﬁnition 31 For a matrix A and a positive number B, the B-partial sum of A, denoted by SB(A), is SB(A) = X

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

When A is an (equi)characteristic matrix, SB(A) is the sum over all entries corresponding to grids with at most B total cells. Thus, if c M(D) is a sample characteristic matrix of a sample D, SB(c M(D)) = TICB(D), and the same holds for SB(d [M](D)) and TICe,B(D). It is clear that if X and Y are statistically independent random variables, then both the characteristic matrix M(X, Y ) and the equicharacteristic matrix [M](X, Y ) are identically 0, so that SB(M(X, Y )) = SB([M](X, Y )) = 0 for all B. However, we are also interested in how these quantities behave when X and Y are dependent. The following pair of propositions helps us understand this. The ﬁrst proposition shows a lower bound on the values of entries in both M(X, Y ) and [M](X, Y ). The second proposition translates this into an asymptotic characterization of how quickly SB(M) and SB([M]) grow as functions of B. These two propositions are the technical heart of why the total information coeﬃcient yields a consistent independence test.

Proposition 32 Let (X, Y ) be a pair of jointly distributed random variables. If X and Y are statistically independent, then M(X, Y ) [M](X, Y ) 0. If not, then there exists some a > 0 and some integer ℓ0 2 such that

M(X, Y )k,ℓ, [M](X, Y )k,ℓ a log min{k, ℓ}

either for all k ℓ ℓ0, or for all ℓ k ℓ0.

Proof See Appendix K.1

Proposition 33 Let (X, Y ) be a pair of jointly distributed random variables. If X and Y are statistically independent, then SB(M(X, Y )) = SB([M](X, Y )) = 0 for all B > 0. If not, then SB(M(X, Y )) and SB([M](X, Y )) are both Ω(B log log B).

Proof See Appendix K.2

The propositions above, together with reasoning analogous to the convergence arguments presented earlier, can be used to show the main result of this section, namely that the statistics TIC and TICe yield consistent independence tests.

Theorem 34 The statistics TICB and TICe,B yield consistent right-tailed tests of independence, provided ω(1) < B(n) O(n1 ε) for some ε > 0.

Proof See Appendix K.3.

In practice, we often use the Equichar Clump algorithm (see Section 4.3) to compute the equicharacteristic matrix from which we calculate TICe. This algorithm does not compute the sample equicharacteristic matrix exactly. However, as in the case of MICe, the use of the algorithm does not aﬀect the theoretical properties of the statistic. This is proven in Appendix H.

Measuring Dependence Powerfully and Equitably

5.2 Power of Independence Tests Based on TICe

With the consistency of independence tests based on TIC and TICe established, we turn now to empirical evaluation of the power of independence testing based on TICe as computed using the Equichar Clump algorithm. To evaluate the power of TICe-based tests, we reproduced the analysis performed in Simon and Tibshirani (2012). Namely, we considered the set of relationships they analyzed, deﬁned by

Q = (X, f(X) + ε ) : X Unif, f F, ε N(0, σ2), σ R 0 .

where F is a set of functions speciﬁed in Simon and Tibshirani (2012). (NB: one of the relationships is a circle, which we treat as a union of two half-circles.) For each relationship Z in this set that we examined, we simulated a null hypothesis of independence with the same marginal distributions, and generated 1, 000 independent samples, each with a sample size of n = 500, from both Z and from the null distribution. These were used to estimate the power of the size-α right-tailed independence test based on each statistic being evaluated. Following Simon and Tibshirani, we compared TICe to the distance correlation (Székely et al., 2007; Székely and Rizzo, 2009), the original maximal information coeﬃcient (Reshef et al., 2011) as approximated using Approx-MIC, and to the Pearson correlation. (Though it is not a measure of dependence, the Pearson correlation was presumably included by Simon and Tibshirani as an intuitive benchmark for what is achievable under a linear model.) We also compared to MICe using identical parameters to those of TICe to examine whether the summation performed by TICe is better than maximization when all other things are equal. Note that we do not compare to methods of analyzing contingency tables, such as Pearson s chi-squared test. This is because our data are real-valued rather than discrete, and so contingency-based methods are not applicable. However, when data are discrete, those methods can be very well powered. The results of our analysis are presented in Figure 5. First, the ﬁgure shows that TICe compares quite favorably with distance correlation, a method considered to have state-of-theart power (Simon and Tibshirani, 2012). Speciﬁcally, TICe uniformly outperforms distance correlation on 5 of the 8 relationship types examined, and performs comparably to it on the other three relationship types. We remark that distance correlation has many advantages over TICe, including the fact that it easily generalizes to higher-dimensional relationships and comes with an elegant and comprehensive theoretical framework. The analysis also shows that TICe outperforms the original maximal information coeﬃcient by a very large margin, and outperforms MICe as well, supporting the intuition that the summation performed by the former can indeed lead to substantial gains in power against independence over the maximization performed by the latter. (We note that in both Simon and Tibshirani s analysis and in this one, the original maximal information coeﬃcient was run with default parameters that were optimized for equitability rather than power against independence. When run with diﬀerent parameters, its power improves substantially, though it still does not match the power of MICe. See Appendix Figure I2 and the discussion in Reshef et al., 2015a.) Our companion paper (Reshef et al., 2015a) expands on this analysis, conducting an indepth evaluation of the the power against independence of the tests described above as well

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Figure 5: Comparison of power of independence testing based on TICe (blue) to MIC with default parameters (gray), MICe with the same parameters as TICe (black), distance correlation (purple), and the Pearson correlation coeﬃcient (green) across several alternative hypothesis relationship types chosen by Simon and Tibshirani (2012). The relationships analyzed are described in Section 5.2.

Measuring Dependence Powerfully and Equitably

as tests based on mutual information estimation (Kraskov et al., 2004), maximal correlation estimation (Breiman and Friedman, 1985), HSIC (Gretton et al., 2005, 2007), HHG (Heller et al., 2013), DDP (Heller et al., 2016), and RDC (Lopez-Paz et al., 2013). These analyses consider a range of sample sizes and parameter settings, as well as a variety of ways of quantifying power across diﬀerent alternative hypothesis relationship types and noise levels. They conclude that in most settings TICe either outperforms all the methods tested or performs comparably to the best ones. Appendix Figure I2 contains a reproduction of one detailed set of power curves from the main analysis in that paper for the reader s reference.

6. Conclusion

As high-dimensional data sets become increasingly common, data exploration requires not only statistics that can accurately detect a large number of non-trivial relationships in a data set, but also ones that can identify a smaller number of strongest relationships. The former property is achieved by measures of dependence that yield independence tests with high power; the latter is achieved by measures of dependence that are equitable with respect to some measure of relationship strength. In this paper, we introduced two related measures of dependence that achieve these two goals, respectively, through the following theoretical contributions.

A new population measure of dependence, MIC , that we proved can be viewed in three diﬀerent ways: as the population value of the maximal information coeﬃcient (MIC) from Reshef et al. (2011), as a minimal smoothing of mutual information that makes it uniformly continuous, or as the supremum of an inﬁnite sequence deﬁned in terms of optimal partitions of one marginal at a time of a given joint distribution.

An eﬃcient approach for approximating the MIC of a given joint distribution.

A statistic MICe that is a consistent estimator of MIC , is eﬃciently computable, and has good equitability with respect to R2 both on a manually chosen set of noisy functional relationships as well as on a set of randomly chosen noisy functional relationships.

The total information coeﬃcient (TICe), a statistic that arises as a trivial side-product of the computation of MICe and yields a consistent and powerful independence test.

Though we presented here some empirical results for MIC , MICe, and TICe, our focus was on theoretical considerations; the performance of these methods is analyzed in detail in our companion paper (Reshef et al., 2015a). That paper shows that on a large set of noisy functional relationships with varying noise and sampling properties, the asymptotic equitability with respect to R2 of MIC is quite high and the equitability with respect to R2 of MICe is state-of-the-art. It also shows that the power of the independence test based on TICe is state-of-the-art across a wide variety of dependent alternative hypotheses. Finally, it demonstrates that the algorithms presented here allow for MICe and TICe to be computed simultaneously very quickly, enabling analysis of extremely large data sets using both statistics together.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Our contributions are of both theoretical and practical importance for several reasons. First, our characterization of MIC as the large-sample limit of MIC sheds light on the latter statistic. For example, while MIC is parametrized, MIC is not. Knowing that MIC converges in probability to MIC tells us that this parametrization is statistical only: it controls the bias/variance properties of the statistic, but not its asymptotic behavior. Second, the normalization in the deﬁnition of MIC, while empirically seen to yield good performance, had previously not been theoretically understood. Our result that this normalization is the minimal smoothing necessary to make mutual information uniformly continuous provides for the ﬁrst time a lens through which the normalization is canonical. In doing so, it constitutes an initial step toward understanding the role of the normalization in the performance of MIC and MIC. The uniform continuity of MIC and the lack of continuity of ordinary mutual information also suggest that estimation of the former may be easier in some sense than estimation of the latter. This is consonant with a recent result concerning diﬃculty of estimation of mutual information shown in Ding and Li (2013). It is also borne out empirically by the substantial ﬁnite-sample bias and variance observed in Reshef et al. (2015a) of the Kraskov mutual information estimator (Kraskov et al., 2004) compared to MICe. Third, our alternate characterization of MIC in terms of one-dimensional optimization over partitions rather than two-dimensional optimization over grids enhances our understanding of how to eﬃciently compute it in the large-sample limit and estimate it from ﬁnite samples using MICe. This is a signiﬁcant improvement over the previous state of aﬀairs, in which the statistic MIC could only be approximated heuristically, with even the heuristic approximation being orders of magnitude slower than the results in this paper now allow. Finally, the introduction of the total information coeﬃcient provides evidence that the basic approach of considering the set of normalized mutual information values achievable by applying diﬀerent grids to a joint distribution is of fundamental value in characterizing dependence. Interestingly, a statistic introduced in Heller et al. (2016) follows a similar approach by considering the (non-normalized) sum of the mutual information values achieved by all possible ﬁnite grids. Consistent with our demonstration here that an aggregative gridbased approach works well, that statistic also achieves excellent power. (TICe is compared to the statistic from Heller et al. 2016 in our companion paper, Reshef et al., 2015a.) Taken together, our results point to joint use of the statistics MICe and TICe as a theoretically grounded, computationally eﬃcient, and highly practical approach to data exploration. Speciﬁcally, since the two statistics can be computed simultaneously with little extra cost beyond that of computing either individually, we propose computing both of them on all variable pairs in a data set, using TICe to ﬁlter out non-signiﬁcant associations, and then using MICe to rank the remaining variable pairs. Such a strategy would have the advantage of leveraging the state-of-the-art power of TICe to substantially reduce the multiple-testing burden on MICe, while utilizing the latter statistic s state-of-the-art equitability to eﬀectively rank relationships for follow-up by the practitioner. Our results, while useful, nevertheless have limitations that warrant exploration in future work. First, for a sample D from the distribution of some random (X, Y ), all of the sample quantities we deﬁne here use the naive estimate I(D|G) of the quantity I((X, Y )|G) for various grids G. There is a long and fruitful line of work on more sophisticated estimators of the discrete mutual information Paninski (2003) whose use instead of I(D|G) could improve

Measuring Dependence Powerfully and Equitably

the statistics introduced here. Second, our approach to approximating the MIC of a given joint density consists of computing a ﬁnite subset of an inﬁnite set whose supremum we seek to calculate. However, the choice of how large a ﬁnite set we should compute in order to approximate the supremum to a given precision remains heuristic. Finally, though empirical characterization of the equitability of MICe on representative sets of relationships is important and promising, we are still missing a theoretical characterization of its equitability in the large-sample limit. A clear theoretical demarcation of the set of relationships on which MIC achieves good equitability with respect to R2, and an understanding of why that is, would greatly advance our understanding of both MIC and equitability.

Acknowledgments

We would like to acknowledge R Adams, E Airoldi, T Broderick, A Gelman, M Gorﬁne, R Heller, J Huggins, T Jaakkola, J Mueller, J Tenenbaum, and R Tibshirani for constructive conversations and useful feedback. HKF was supported by the Fannie and John Hertz Foundation. MM was supported in part by NSF grants CCF-1563710, CCF-1535795, CCF1320231, and CNS-1228598. DNR and YAR were supported by the Paul and Daisy Soros Foundation. YAR was supported by award Number T32GM007753 from the National Institute of General Medical Sciences, as well as the National Defense Science and Engineering Graduate Fellowship. PCS was supported by the Howard Hughes Medical Institute. The content of this paper is solely the responsibility of the authors and does not necessarily represent the oﬃcial views of the National Institute of General Medical Sciences or the National Institutes of Health.

Appendix A. Proof of Theorem 6

This appendix is devoted to proving Theorem 6, restated below.

Theorem Let f : m R be uniformly continuous, and assume that f ri f pointwise. Then for every random variable (X, Y ), we have

f r B(n) c M(Dn) f(M(X, Y ))

in probability where Dn is a sample of size n from the distribution of (X, Y ), provided ω(1) < B(n) O(n1 ε) for some ε > 0.

We prove the theorem by a sequence of lemmas that build on each other to bound the bias of I (D, k, ℓ). The general strategy is to capture the dependencies between diﬀerent k-by-ℓgrids G by considering a master grid Γ that contains many more than kℓcells. Given this master grid, we ﬁrst bound the diﬀerence between I(D|G) and I((X, Y )|G) only for sub-grids G of Γ. The bound is in terms of the diﬀerence between D|Γ and (X, Y )|Γ. We then show that this bound can be extended without too much loss to all k-by-ℓgrids. This gives what we seek, because then the diﬀerence between I(D|G) and I((X, Y )|G) is uniformly bounded for all grids G in terms of the same random variable: D|Γ. Once this is done, standard arguments give the consistency we seek.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

In our argument we occasionally require technical facts about entropy and mutual information that are self-contained and unrelated to the central ideas. These lemmas are consolidated in Appendix L. We begin by using one of these technical lemmas to prove a bound on the diﬀerence between I(D|G) and I((X, Y )|G) that is uniform over all grids G that are sub-grids of a much denser grid Γ. The common structure imposed by Γ will allow us to capture the dependence between the quantities |I(D|G) I((X, Y )|G)| for diﬀerent grids G.

Lemma 35 Let Π = (ΠX, ΠY ) and Ψ = (ΨX, ΨY ) be random variables distributed over the cells of a grid Γ, and let (πi,j) and (ψi,j) be their respective distributions. Deﬁne

εi,j = ψi,j πi,j

Let G be a sub-grid of Γ with B cells. Then for every ﬁxed 0 < a < 1 we have

|I(Ψ|G) I(Π|G)| O

when |εi,j| 1 a for all i and j.

Proof Let P = Π|G and Q = Ψ|G be the random variables induced by Π and Ψ respectively on the cells of G. Using the fact that I(X, Y ) = H(X) + H(Y ) H(X, Y ), we write

|I(Q) I(P)| |H(QX) H(PX)| + |H(QY ) H(PY )| + |H(Q) H(P)|

where QX and PX denote the marginal distributions on the columns of G and QY and PY denote the marginal distributions on the rows. We can bound each of the terms on the right-hand side of the equation above using a Taylor expansion argument given in Lemma 51, whose proof is found in Appendix L. Doing so gives

|I(Q) I(P)| (ln B)

i O (|εi, |) + X

j O (|ε ,j|) + X

i,j O (|εi,j|)

P j(ψi,j πi,j) P j πi,j

and ε ,j is deﬁned analogously. To obtain the result, we observe that

P j πi,jεi,j P j πi,j

P j πi,j |εi,j| P j πi,j X

since πi,j/ P j πi,j 1, and the analogous bound holds for |ε ,j|.

Measuring Dependence Powerfully and Equitably

We now extend Lemma 35 to all grids with B cells rather than just those that are sub-grids of the master grid Γ. The proof of this lemma relies on an information-theoretic result proven in Appendix B that bounds the diﬀerence in mutual information between two distributions that can be obtained from each other by moving a small amount of probability mass.

Lemma 36 Let Π = (ΠX, ΠY ) and Ψ = (ΨX, ΨY ) be random variables, and let Γ be a grid. Deﬁne εi,j on Π|Γ and Ψ|Γ as in Lemma 35. Let G be any grid with B cells, and let δ (resp. d) represent the total probability mass of Π|Γ (resp. Ψ|Γ) falling in cells of Γ that are not contained in individual cells of G. We have that

|I(Ψ|G) I(Π|G)| O

i,j |εi,j| + δ + d

log B + δ log(1/δ) + d log(1/d)

provided that the |εi,j| are bounded away from 1 and that d, δ 1/2.

Proof In the proof below, we use the convention that for any two grids G and G and any random variable Z, the expression Z(G, G ) denotes |I(Z|G) I(Z|G )|. Consider the grid G obtained by replacing every horizontal or vertical line in G that is not in Γ with a closest line in Γ. The grid G is clearly a sub-grid of Γ. Moreover, Π|G (resp. Ψ|G ) can be obtained from Π|G (resp. Π|G) by moving at most δ (resp. d) probability mass. This can be shown to imply that

Π(G, G ) O (δ log(1/δ) + δ log B) and Ψ(G , G) O (d log(1/d) + d log B) .

The proof of this information-theoretic fact is self-contained and so we defer it to Proposition 40 in Appendix B, as it is more central to the arguments presented there. With Φ(G, G ) and Ψ(G , G) bounded in terms of δ and d, we can bound |I(Ψ|G) I(Φ|G)| using the triangle inequality by comparing it with

Π(G, G ) + |I (Π|G ) I (Ψ|G )| + Ψ(G , G)

and bounding the middle term using Lemma 35, since G Γ.

We now use the fact that the variables εi,j deﬁned in Lemma 35 are small with high probability to give a concrete bound on the bias of I(D|G) that is uniform over all k-by-ℓ grids G and that holds with high probability. It is useful at this point to recall that, given a distribution (X, Y ), an equipartition of (X, Y ) is a grid G such that all the rows of (X, Y )|G have the same probability mass, and all the columns do as well.

Lemma 37 Let Dn be a sample of size n from the distribution of a pair (X, Y ) of jointly distributed random variables. For any α 0, any ε > 0, and any integers k, ℓ> 1, we have that for all n

|I(Dn|G) I((X, Y )|G)| O log(kℓ)

C(n)α + log(kℓn)

for every k-by-ℓgrid G with probability at least 1 C(n)e Ω(n/C(n)1+2α), where C(n) = kℓnε/2.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Proof Fix n, and let Γ be an equipartition of (X, Y ) into knε/4 rows and ℓnε/4 columns. C(n) is now the number of cells in Γ. Lemma 36, with Π = (X, Y ) and Ψ = D, shows that |I(D|G) I((X, Y )|G)| is at most

i,j |εi,j| + δ + d

log(kℓ) + δ log(1/δ) + d log(1/d)

provided the εi,j have absolute value bounded away from 1, and provided that d, δ 1/2. The remainder of the proof proceeds as follows. We ﬁrst show that the εi,j are small with high probability. This will both show that the lemma s requirement on the εi,j holds and allow us to bound the sum in the inequality above. We will then use our bound on the εi,j to bound d in terms of δ. Finally, we will bound δ using the fact that the number of rows and columns in Γ increases with n. This will give us that d, δ 1/2 and allow us to bound the rest of the terms in the expression above. Bounding the εi,j: We bound the εi,j using a multiplicative Chernoﬀbound. Let πi,j and ψi,j represent the probability mass functions of (X, Y )|Γ and D|Γ respectively. We write

P (|εi,j| δ) = P (πi,j(1 δ) ψi,j πi,j(1 + δ))

e Ω(nπi,jδ2)

since ψi,j is a sum of n i.i.d. Bernoulli random variables and E (ψi,j) = nπi,j. (See, e.g., Mitzenmacher and Upfal 2005.) Setting δ = πi,j/C(n)1/2+α yields

P |εi,j| πi,j C(n)1/2+α

e Ω(n/C(n)1+2α).

A union bound over the pairs (i, j) then gives that, with the desired probability, the above bound on |εi,j| holds for all i, j. Bounding P |εi,j|: The bound on the εi,j implies that

i |εi,j| 1 C(n)1/2+α X

1 C(n)1/2+α p

where the second line follows from the fact that the function P πi,j is symmetric and concave and therefore, when restricted to the hyperplane P πi,j = 1, must achieve its maximum when πi,j = 1/C(n) for all i, j. Bounding d in terms of δ: We use our bound on the εi,j to bound d. We do so by observing that it implies

1 + πi,j C(n)1/2+α

= πi,j + π3/2 i,j C(n)1/2+α πi,j + πi,j C(n)1/2+α 2πi,j

Measuring Dependence Powerfully and Equitably

since πi,j 1 and C(n) 1. The connection to d comes from the fact that for any column j of Γ, this means that

i πi,j = 2π ,j.

This also applies to the sums across rows. Since d is a sum of terms of the form ψ ,j and ψi, for j in some index set J and i in an index set I, and δ is a sum of terms of the form π ,j and πi, with the same index sets, we therefore get that d 2δ. Bounding δ and obtaining the result: To bound δ, we observe that because G has at most ℓ 1 vertical lines and k 1 horizontal lines, we have

δ ℓ ℓnε/4 + k knε/4 2 nε/4 .

This bound on δ allows us to bound the terms involving d and δ by

Combining all of the bounds gives the desired result.

Our ﬁnal lemma shows that as long as B(n) doesn t grow too fast, the bound from the previous lemma yields a uniform bound on the entire sample characteristic matrix. This is done by specifying an error threshold for which Lemma 37 yields a bound that holds with high probability, and then invoking a union bound.

Lemma 38 Let Dn be a sample of size n from the distribution of a pair (X, Y ) of jointly distributed random variables. For every B(n) = O n1 ε , there exists an a > 0 such that for suﬃciently large n, c M(Dn)k,ℓ M(X, Y )k,ℓ O 1

holds for all kℓ B(n) with probability P(n) = 1 o(1), where c M(Dn)k,ℓis the k, ℓ-th entry of the sample characteristic matrix and M(X, Y )k,ℓis the k, ℓ-th entry of the population characteristic matrix of (X, Y ).

Proof Fix k, ℓ, and any α satisfying 0 < α < ε/(4 2ε). Lemma 37 implies that with high probability the diﬀerence |c M(Dn)k,ℓ Mk,ℓ| is at most

C(n)α + log(kℓn)

C(n)α + log n

nαε/2 + log n

where the ﬁrst inequality comes from kℓ B(n) and second is because C(n) = kℓnε/2 nε/2. This bound is at most O (1/na) for every a < min{αε/2, ε/4}, as desired. It remains only to show that the bound holds with high probability across all kℓ B(n).

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Lemma 37 states that the probability our bound holds for one ﬁxed pair (k, ℓ) is at least

1 C(n)e Ω(n/C(n)1+2α) 1 O (n) e Ω(nu)

for some positive u. This is because C(n) B(n)nε/2 O n1 ε/2 for large n, and so our choice of α ensures that C(n)1+2α = O n1 u for some u > 0. We can then perform a union bound over all pairs kℓ B(n): since the number of such pairs can be bounded by a polynomial in n, we have that the desired condition is satisﬁed for all kℓ B(n) with probability approaching 1.

We are now ready to prove the main result.

Theorem Let f : m R be uniformly continuous, and assume that f ri f pointwise. Then for every random variable (X, Y ), we have f r B(n) c M(Dn) f(M(X, Y ))

in probability where Dn is a sample of size n from the distribution of (X, Y ), provided ω(1) < B(n) O(n1 ε) for some ε > 0.

Proof Let N denote B(n), let MN = r N(M), and let c MN(Dn) = r N(c M(Dn)). We begin by writing f c MN(Dn) f(M) f c MN(Dn) f (MN) + |f (MN) f(M)|

= f c MN(Dn) f (MN) + |(f r N) (M) f(M)|

and observing that as n , the second term vanishes by the pointwise convergence of f ri and the fact that B(n) > ω(1). It therefore suﬃces to show that the ﬁrst term converges to zero in probability. Since f is uniformly continuous, we can establish this via a simple adaptation of the continuous mapping theorem, which says that if the sequence of random variables Rn R in probability, and g is continuous, then g(Rn) g(R) in probability. We replace R with a second sequence, and replace continuity with uniform continuity. Let denote the supremum norm on m , and ﬁx any z > 0. Then, for any δ > 0, deﬁne Cδ = A m : A m s.t. A A < δ, f(A) f A ) > z .

This is the set of matrices A m for which it is possible to ﬁnd, within a δ-neighborhood of A, a second matrix that f maps to more than z away from f(A). Because f is uniformly continuous, there exists a δ suﬃciently small so that Cδ = . Suppose that |f(c MN(Dn)) f(MN)| > z. This means that either c MN(Dn) MN > δ , or MN Cδ . The latter option is impossible since Cδ = , and Lemma 38 tells us that P c MN(Dn) MN > δ 0 as n grows. We therefore have that f c MN(Dn) f(MN) 0

in probability, as desired.

Measuring Dependence Powerfully and Equitably

Appendix B. Proof of Theorem 8

In this appendix we prove Theorem 8, reproduced below.

Theorem Let P(R2) denote the space of random variables supported on R2 equipped with the metric of statistical distance. The map from P(R2) to m deﬁned by (X, Y ) 7 M(X, Y ) is uniformly continuous.

The proposition below begins our argument with the simple observation that the family of maps consisting of applying any ﬁnite grid to some (X, Y ) P(R2) is uniformly equicontinuous. The reason this holds is that (X, Y )|G is a deterministic function of (X, Y ), and deterministic functions cannot increase statistical distance.

Proposition 39 Let G be the set of all ﬁnite grids. The family {(X, Y ) 7 (X, Y )|G : G G} is uniformly equicontinuous on P(R2).

Proof To establish uniform equicontinuity, we need to show that, given some (X, Y ) P(R2) and some ε > 0, we can choose δ to satisfy the continuity condition in a way that does not depend on G or on (X, Y ). But because deterministic functions cannot increase statistical distance, we have that if (X, Y ), (X , Y ) P are at most ε apart then

(X, Y )|G, (X , Y )|G (X, Y ), (X , Y ) = ε

where denotes statistical distance. Choosing δ = ε therefore gives the result.

At this point it is tempting to try to use continuity properties of discrete mutual information to obtain uniform continuity of the characteristic matrix. And indeed, this strategy does yield that each individual entry of the characteristic matrix is a uniformly continuous function. However, to obtain continuity of the entire (inﬁnite) characteristic matrix we need to make a statement about all grid resolutions simultaneously. This is not straightforward because mutual information is only uniformly continuous for a ﬁxed grid resolution, and the family {(X, Y ) 7 I((X, Y )|G) : G G} is in fact not even equicontinuous. The normalization in the deﬁnition of MIC is what allows us to establish the uniform continuity of the characteristic matrix despite this problem. To see why, suppose we have a distribution over a k-by-ℓgrid and we are allowed to move at most δ away in statistical distance for some small δ. The largest change in discrete mutual information that this can cause indeed increases as we increase k and ℓ. However, it turns out that we can bound the extent of this non-uniformity : the proposition below shows that as we move away from a distribution, the discrete mutual information can change only proportionally to the amount of mass we move, with the proportionality constant bounded by log min{k, ℓ}. Because log min{k, ℓ} is the quantity by which we regularize the entries of the characteristic matrix, this is exactly enough to make the normalized matrix continuous. This proposition is the technical heart of our continuity result. And as we show in Corollary 11 when we demonstrate the non-continuity of the non-normalized characteristic matrix mutual information, our bound is tight.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Proposition 40 Let Ik,ℓ: P({1, . . . , k} {1, . . . , ℓ}) R denote the discrete mutual information function on k-by-ℓgrids. For 0 < δ 1/4, the maximal change in Ik,ℓover any subset of P({1, . . . , k} {1, . . . , ℓ}) of diameter δ (in statistical distance) is

+ δ log min{k, ℓ} .

Proof Without loss of generality, assume k ℓ, so that log min{k, ℓ} = log k. Let (X, Y ) and (X , Y ) be two random variables distributed over {1, . . . , k} {1, . . . , ℓ} that are at most δ apart in statistical distance. Using I(X, Y ) = H(Y ) H(Y |X), we can express the diﬀerence between the mutual information of these two pairs of random variables as

I(X, Y ) I(X , Y ) H(Y ) H(Y ) + H(Y |X) H(Y |X ) .

We now use Lemma 55, which relates movement of probability mass to changes in entropy and is proven in Appendix L, to separately bound each of the terms on the right hand side. Straightforward application of the lemma to |H(Y ) H(Y )| shows that it is at most 2Hb(2δ)+3δ log k, where Hb( ) is the binary entropy function. Since Hb(x) O(x log(1/x)) for x small, this is O(δ log(1/δ) + δ log k). Bounding the term with the conditional entropies is more involved. Let px = P (X = x), and let p x = P (X = x). We have

H(Y |X) H(Y |X ) = X

px H(Y |X = x) p x H(Y |X = x)

px H(Y |X = x) H(Y |X = x) + (1)

p x px H(Y |X = x)

x px H(Y |X = x) H(Y |X = x) + X

p x px log k

x px H(Y |X = x) H(Y |X = x) + δ log k (2)

where the last line is because P x |px p x| δ and H(Y |X = x) log k. Now let δx+ be the magnitude of all the probability mass entering any cell in column x, let δx be the magnitude of all the probability mass leaving any cell in column x, and let δx = δx+ + δx . Using this notation, we can again apply Lemma 55 to obtain

x px H(Y |X = x) H(Y |X = x) X

2Hb(2δ) + 3δ log k

Measuring Dependence Powerfully and Equitably

where the last line is by application of Lemma 52 from the appendix, which bounds weighted sums of binary entropies. Combining this with Line (2) gives that H(Y |X) H(Y |X ) 2Hb(2δ) + 4δ log k

which, together with the bound on |H(Y ) H(Y )| and the fact that Hb(X) O(x log(1/x)) for x small, gives the result.

Having bounded the extent to which variation in mutual information depends on grid resolution, we are now ready to show the uniform continuity of the characteristic matrix.

Theorem Let P(R2) denote the space of random variables supported on R2 equipped with the metric of statistical distance. The map from P(R2) to m deﬁned by (X, Y ) 7 M(X, Y ) is uniformly continuous.

Proof We complete the proof in three steps. First, we show that a certain family of functions F is uniformly equicontinuous. Second, we use this to show that a diﬀerent family F consisting of functions of the form supg A g with A F is uniformly equicontinuous. Finally, we argue that since the entries of M(X, Y ) consist of the functions in F , this is suﬃcient to establish the result. Deﬁne

F = (X, Y ) 7 Ik,ℓ((X, Y )|G)

log min{k, ℓ} : k, ℓ Z>1, G G(k, ℓ) .

F is uniformly equicontinuous by the following argument. Given some ε > 0, we know (Proposition 39) that for any (X , Y ) in an ε-ball around (X, Y ), (X , Y )|G will remain within ε of (X, Y )|G for any G. Proposition 40 then tells us that if ε is suﬃciently small then the distance between Ik,ℓ((X , Y )|G) and Ik,ℓ((X, Y )|G) will be at most

O (ε log(1/ε) + ε log min{k, ℓ}) .

After the normalization, this becomes at most O(ε(log(1/ε) + 1)), which goes to zero (uniformly with respect to (X, Y )) as ε approaches zero, as desired. Next, deﬁne F = {(X, Y ) 7 M(X, Y )k,ℓ: k, ℓ Z>1} .

Each map in F is of the form supg A g for some A F. Therefore, for a given ε > 0, whatever δ establishes the uniform equicontinuity for F can be used to establish continuity of all the functions in F . (To see this: supg A g can t increase by more than ε if no g increases by more than ε, and supg A g is also lower bounded by any of the g s, so it can t decrease by more than ε either.) Since we can use the same δ for all of the maps in F , they therefore form a uniformly equicontinuous family. Finally, the δ provided by the uniform equicontinuity of F also ensures that M(X , Y ) is within ε of M(X, Y ) in the supremum norm, thus giving the uniform continuity of (X, Y ) 7 M(X, Y ).

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Appendix C. Proof of Proposition 10

Theorem For some function N(k, ℓ), let MN be the characteristic matrix with normalization N, i.e.,

MN(X, Y ) = I ((X, Y ), k, ℓ)

If N(k, ℓ) = o(log min{k, ℓ}) along some inﬁnite path in N N, then MN and sup MN are not continuous as functions of P([0, 1] [0, 1]) P(R2).

Proof Consider a random variable Z uniformly distributed on [0, 1/2]2. Because Z exhibits statistical independence, I (Z, k, ℓ) is zero for all k, ℓ. Now deﬁne Zε to be uniformly distributed on [0, 1/2]2 with probability 1 ε and uniformly distributed on the line from (1/2, 1/2) to (1, 1) with probability ε. We lower-bound I (Zε, k, ℓ). Without loss of generality suppose that k ℓ, and consider a grid that places all of [0, 1/2]2 into one cell and uniformly partitions the set [1/2, 1]2 into k 1 rows and k 1 columns. By considering just the rows/columns in the set [1/2, 1]2 we see that this grid gives a mutual information of at least ε log(k 1). Thus, we have that for all k, ℓ, I (Zε, k, ℓ) ε log min{k 1, ℓ 1}.

This implies that the limit of MN(Zε) along P is , and so the distance between MN(Z) and MN(Zε) in the supremum norm is inﬁnite.

Appendix D. Proof of Theorem 16

Theorem Let M be a population characteristic matrix. Then Mk, equals

max P P(k) I(X, Y |P )

where P(k) denotes the set of all partitions of size at most k.

Proof Deﬁne

M k, = max P P(k) I(X, Y |P )

We wish to show that M k, is in fact equal to Mk, . To show that Mk, M k, , we observe that for every k-by-ℓgrid G = (P, Q), where P is a partition into rows and Q is a partition into columns, the data processing inequality gives I((X, Y )|G) I(X, Y |P ). Thus Mk,ℓ M k, for ℓ k, implying that

Mk, = lim ℓ Mk,ℓ M k, .

It remains to show that M k, Mk, . To do this, we let P be any partition into k rows, and we deﬁne Qℓto be an equipartition into ℓcolumns. We let

M k,ℓ,P = I(X|Qℓ, Y |P )

Measuring Dependence Powerfully and Equitably

Since M k,ℓ,P Mk,ℓwhen ℓ k, we have that for all P

I(X, Y |P )

log k = lim ℓ M k,ℓ,P lim ℓ Mk,ℓ= Mk,

which gives that

M k, = sup P

I(X, Y |P )

as desired.

Appendix E. Proof of Theorem 18

Theorem Given a random variable (X, Y ), Mk, (resp. M ,ℓ) is computable to within an additive error of O(kε log(1/(kε))) + E (resp. O(ℓε log(1/(ℓε))) + E) in time O(k T(E)/ε) (resp. O(ℓT(E)/ε)), where T(E) is the time required to numerically compute the mutual information of a continuous distribution to within an additive error of E.

Proof Without loss of generality we prove the claim only for Mk, . Given 0 < ε < 1, we would like a partition into rows P of size at most k such that I(X, Y |P ) is maximized. We would like to use Optimize XAxis for this purpose, but while our search problem is continuous, Optimize XAxis can only perform a discrete search over sub-partitions of some master partition Π. We therefore set Π to be an equipartition into 1/ε rows and show that this gets us close enough to achieve the desired result. With Π as described, the Optimize XAxis provides in time O(k T(E)/ε) a partition P0 into at most k rows such that I (X, Y |P0) is maximized, subject to P0 Π, to within an additive error of E. To prove the claim then, we must show that the loss we incur by restricting to sub-partitions of Π costs us at most O(kε log(1/(kε))). In other words, we must show that I (X, Y |P ) I (X, Y |P0) O(kε)

where P is an optimal partition into rows. Note that we have omitted the absolute value above, since by the optimality of P, I (X, Y |P ) I (X, Y |P0) always. We prove the desired bound by showing that there exists some P Π such that the mutual information of (X, Y |P ) is O(kε log(1/(kε)))-close to that achieved with (X, Y |P ). Since P Π gives us that I (X, Y |P0) I (X, Y |P ), we may then conclude that I (X, Y |P ) I (X, Y |P0) is at most O(kε log(1/(kε))). We construct P by simply replacing every horizontal line in P with a horizontal line in Π closest to it. Since there are at most k 1 horizontal lines in P, and each such line is contained in a row of Π containing 1/ε probability mass, performing this operation moves at most (k 1)ε probability mass. In other words, the statistical distance between (X, Y |P ) and (X, Y |P ) is at most (k 1)ε kε. Thus, for suﬃciently small ε, Proposition 40, proven in Appendix B, can be used to show that

|I (X, Y |P ) I (X, Y |P )| O kε log 1

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

which yields the desired result.

Remark 41 We do not explore here the details of the numerical integration associated with the above theorem, since the error introduced by the numerical integration is independent of the algorithm being proposed. However, standard numerical integration methods can be used to make this error arbitrarily small with an understood complexity tradeoﬀ(see, e.g., Stoer and Bulirsch 1980).

Appendix F. Proof of Theorem 21

Theorem Let (X, Y ) be jointly distributed random variables. Then [M] = M.

Proof Without loss of generality, we show that [M]k, = Mk, . Fix any partition into rows P. If Qℓis an equipartition into ℓcolumns then

lim ℓ I(X|Qℓ, Y |P ) = I(X, Y |P ),

because the continuous mutual information equals the limit of the discrete mutual information with increasingly ﬁne partitions. (See, e.g., Chapter 8 of Cover and Thomas 2006 for a proof of this.) This means that, letting P(k) denote the set of all partitions of size at most k, we have

[M]k, = max P P(k) I(X, Y |P )

log k = Mk,

where the second equality follows from Proposition 16.

Appendix G. Consistency of MICe in Estimating MIC

The consistency of MICe for estimating MIC can be established using the same technical lemmas that we used to show that MIC MIC . Speciﬁcally, we can use Lemma 37, which bounds the diﬀerence, for all k-by-ℓgrids G, between the sample quantity I(Dn|G) and the population quantity I((X, Y )|G) with high probability, where Dn is a sample of size n from (X, Y ). That lemma yields the following fact about the sample equicharacteristic matrix, whose proof is similar to that of Lemma 38.

Lemma 42 Let Dn be a sample of size n from the distribution of a pair (X, Y ) of jointly distributed random variables. For every B(n) = O n1 ε , there exists an a > 0 such that for suﬃciently large n, d [M](Dn)k,ℓ [M](X, Y )k,ℓ O 1

holds for all kℓ B(n) with probability P(n) = 1 o(1), where d [M](Dn)k,ℓis the k, ℓ-th entry of the sample equicharacteristic matrix and [M](X, Y )k,ℓis the k, ℓ-th entry of the population equicharacteristic matrix of (X, Y ).

Measuring Dependence Powerfully and Equitably

In the case of MIC, we proceeded to apply abstract continuity considerations to obtain our consistency theorem (Theorem 6) from a result analogous to the above lemma. A similar argument shows us that, in the case of the equicharacteristic matrix as well, we can estimate a large class of functions of the matrix in the same way. This is stated formally in the theorem below. As before, we let m be the space of inﬁnite matrices equipped with the supremum norm, and given a matrix A the projection ri zeros out all the entries Ak,ℓ for which kℓ> i.

Theorem Let f : m R be uniformly continuous, and assume that f ri f pointwise. Then for every random variable (X, Y ), we have

f r B(n) d [M](Dn) f([M](X, Y ))

in probability where Dn is a sample of size n from the distribution of (X, Y ), provided ω(1) < B(n) O(n1 ε) for some ε > 0.

Appendix H. The Equichar Clump Algorithm

In Theorem 28, we sketched an algorithm called Equichar Clump for approximating the sample equicharacteristic matrix that is more eﬃcient than the naive computation. In this appendix, we describe the algorithm in detail, bound its runtime, and show that it indeed yields a consistent estimator of MIC from ﬁnite samples as well as a consistent independence test when used to compute the total information coeﬃcient. We then present some empirical results characterizing the sensitivity of the algorithm to its speed-versusoptimality parameter c. The results in this section can be summarized as follows: let (X, Y ) be a pair of jointly distributed random variables, and let Dn be a sample of size n from the distribution of (X, Y ). For every c 1, there exists a matrix {c M}c(Dn) such that

1. There exists an algorithm Equichar Clump for computing r B({c M}c(Dn)) in time O(n + B5/2), which equals O(n + n5(1 ε)/2) when B(n) = O(n1 ε).

2. The function MICe,B( ) = max kℓ B(n){c M}c( )k,ℓ

is a consistent estimator of MIC provided ω(1) < B(n) O(n1 ε) for some ε > 0.

3. The function ] TICe,B( ) = X

kℓ B(n) {c M}c( )k,ℓ

yields a consistent right-tailed test of independence provided ω(1) < B(n) O(n1 ε) for some ε > 0

We will prove these results in order.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

H.1 Algorithm Description and Analysis of Runtime

We begin by describing the algorithm and bounding its runtime simultaneously. As in the proof of Theorem 27, we bound the runtime required to approximately compute only the k, ℓ-th entries of {c M}c(Dn) satisfying k ℓ, kℓ B. To do this, we analyze two portions of {c M}c(Dn) separately: we ﬁrst consider the case ℓ

B, in which we must compute the entries corresponding to all the pairs {(2, ℓ), . . . , (B/ℓ, ℓ)}. We then consider ℓ<

B, in which case we need only compute the entries {(2, ℓ), . . . , (ℓ, ℓ)} since the additional pairs would all have k > ℓ. For the case of ℓ

B, as in the previous theorem we can simultaneously compute using Optimize XAxis the entries corresponding to all the pairs {(2, ℓ), . . . , (B/ℓ, ℓ)} in time O(|Π|2(B/ℓ)ℓ) = O(|Π|2B), which equals O(c2B3/ℓ2) when we set Π to be an equipartition of size c B/ℓ. Doing this for ℓ=

B, . . . , B/2 gives a contribution of the following order to the runtime.

1 ℓ2 = O c2B3 O 1

= O(c2B5/2)

For the case of ℓ<

B, we can simultaneously compute using Optimize XAxis the entries corresponding to all the pairs {(2, ℓ), . . . , (ℓ, ℓ)} in time O(|Π|2ℓ2) which equals O(c2ℓ4) O(c2B2) when we set Π to be an equipartition of size cℓ. Summing over the O(

B) possible values of ℓwith ℓ<

B gives an upper bound of O(c2B5/2).

H.2 Consistency

Let (X, Y ) be a pair of jointly distributed random variables. For a sample Dn of size n from the distribution of (X, Y ) and a speed-versus-optimality parameter c 1, let {c M}c(Dn) denote the matrix computed by Equichar Clump. (Notice the use of curly braces to diﬀerentiate this from the sample equicharacteristic matrix d [M].) We show here that maxkℓ B(n){c M}c(Dn)k,ℓis a consistent estimator of MIC (X, Y ), and correspondingly that P kℓ B(n){c M}c(Dn)k,ℓyields a consistent independence test. The key to both consistency results is that, though in calculating the k, ℓ-th entry of {c M}c(Dn) the algorithm only searches for optimal partitions that are sub-partitions of some equipartition, the size of the equipartition used always grows as n, k, and ℓgrow large. Therefore, in the limit this additional restriction does not hinder the optimization. We present this argument by introducing a population object called the clumped equicharacteristic matrix. We observe that this matrix is the limit of the Equichar Clump procedure as sample size grows, and then show that the supremum and partial sums of this matrix have the necessary properties.

Deﬁnition 43 Let (X, Y ) be jointly distributed random variables and ﬁx some c 1. Let

I{c }((X, Y ), k, ℓ) = max G I((X, Y )|G)

Measuring Dependence Powerfully and Equitably

where the maximum is over k-by-ℓgrids whose larger partition is an equipartition and whose smaller partition must be contained in an equipartition of size c max{k, ℓ}. The clumped equicharacteristic matrix of (X, Y ), denoted by {M}c(X, Y ), is deﬁned by

{M}c(X, Y )k,ℓ= I{c }((X, Y ), k, ℓ)

log min{k, ℓ}

Notice that curly braces diﬀerentiate the quantities I{c } and {M}c deﬁned above from the corresponding equicharacteristic matrix quantities I[ ] and [M]. The following two results, which we state without proof, characterize the convergence of the output of Equichar Clump to the clumped equicharacteristic matrix. These lemmas can be shown using Lemma 37, which simultaneously bounds the diﬀerence, for all k-by-ℓ grids G, between the sample quantity I(Dn|G) and the population quantity I((X, Y )|G) with high probability over the sample Dn of size n from (X, Y ).

Lemma 44 Let Dn be a sample of size n from the distribution of a pair (X, Y ) of jointly distributed random variables. For every B(n) = O n1 ε , there exists an a > 0 such that for suﬃciently large n,

{c M}c(Dn)k,ℓ {M}c(X, Y )k,ℓ O 1

holds for all k, ℓ p

B(n) with probability P(n) = 1 o(1), where {c M}c(Dn) denotes the matrix computed by the Equichar Clump algorithm with parameter c on the sample Dn.

Notice that the error bound provided by the above lemma holds not for kℓ B(n) as in the analogous Lemma 38 and Lemma 42, but rather for the smaller region deﬁned by k, ℓ p

B(n). However, though we do not have uniform convergence outside the region k, ℓ p

B(n), we do nevertheless have pointwise convergence there, as stated below.

Lemma 45 Fix k, ℓ 2. Let Dn be a sample of size n from the distribution of a pair (X, Y ) of jointly distributed random variables. For every B(n) > ω(1), we have that

{c M}c(Dn)k,ℓ {M}c(X, Y )k,ℓ

in probability as n grows, where {c M}c(Dn) denotes the matrix computed by the Equichar Clump algorithm with parameter c on the sample Dn.

H.2.1 Consistency for Estimating MIC

The consistency of {c M}c(Dn) for estimating MIC follows from the following property of the clumped equicharacteristic matrix {M}c, for which we state a proof sketch.

Proposition 46 Let (X, Y ) be a pair of jointly distributed random variables. Then we have sup{M}c(X, Y ) = MIC (X, Y ).

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Proof (Sketch) Let {M}c = {M}c(X, Y ), and let M = M(X, Y ) be the characteristic matrix. Fix k, and consider the limit {M}c k,ℓas ℓgrows. The grid chosen for the k, ℓ-th entry when ℓ> k will contain an equipartition Pℓof size ℓon the x-axis, and a partition Qℓ of size k on the y-axis that is optimal subject to the restriction that Qℓbe contained in an equipartition of size cℓ. As ℓgrows large, the equipartition Pℓon the ﬁrst axis will become ﬁner and ﬁner until in the limit X|Pℓ X. And the partition Qℓwill be chosen from a ﬁner and ﬁner equipartition, so that in the limit it approaches an unconditionally optimal partition Q of size k. The convergence of Qℓto the optimal partition Q of size k can be shown to be uniform using Proposition 40. This implies that

{M}c k, = lim ℓ {M}c k,ℓ= max P P(k) I(X, Y |P )

where P(k) denotes the set of all partitions of size at most k. Therefore, the boundary {M}c

of {M}c equals the boundary M of M. Since MIC (X, Y ) = sup M (Theorem 15), this implies that sup{M}c sup {M}c = sup M = MIC (X, Y ).

On the other hand, {M}c M element-wise since the optimization for the k, ℓ-th entry of {M}c is performed over a subset of the grids searched for the k, ℓ-th entry of M. This means that sup{M}c sup M = MIC (X, Y ).

This fact, together with the pointwise convergence of {c M}c(Dn) to {M}c, suﬃces to establish the consistency we seek via standard continuity arguments, which we give in the abstract lemma below. The lemma applies to a double-indexed sequence indexed by i and j; in our argument, the index i corresponds to position in the equicharacteristic matrix, and the index j corresponds to sample size. The sequence A corresponds to the output of the Equichar Clump algorithm, the sequence a corresponds to the clumped equicharacteristic matrix, and the sequence B corresponds to the sample equicharacteristic matrix.

Lemma 47 Let {Aij} i,j=1 and {Bij} i,j=1 be sequences of random variables, and let {ai} i=1 be a non-stochastic sequence. Assume that the following conditions hold.

1. Aij Bij almost surely

2. For every i, Aij ai in probability

3. B j = maxi j Bij satisﬁes B j sup{ai} in probability

Then A j = maxi j Aij converges in probability to sup{ai} as well.

Proof Let a = sup{ai}. We give the proof for the case that a < . However, it is easily adapted to the inﬁnite case. We must show that for every ε > 0 and every 0 < p 1, there exists some N such that P(|A j a| < ε) > p for all j N. By the deﬁnition of a, we know that there exists some k such that |ak a| < ε/2. Also, by the convergence of Akj to ak, there exists some m such that P(|Akj ak| < ε/2) > 1 p for all j m. Thus, with probability at least 1 p, we have

|Akj a| |Akj ak| + |ak a|

Measuring Dependence Powerfully and Equitably

for all j m. Next, we observe that since A j Akj for j k, the above inequality implies that for j max{m, k} we have P(A j > a ε) > 1 p. It remains only to show that A j doesn t get too large, but this follows from the fact that A j B j and B j a in probability. Speciﬁcally, we are guaranteed some N max{m, k} such that P(B j < a + ε) > 1 p for j N. Since B j < a+ε implies A j < a+ε, we have that P(|A j a| < ε) > 1 p for j N, as desired.

Proposition 48 The function

MICe,B( ) = max kℓ B(n){c M}c( )k,ℓ

is a consistent estimator of MIC provided ω(1) < B(n) O(n1 ε) for some ε > 0, where {c M}c( ) is the output of the the Equichar Clump algorithm.

Proof Let (X, Y ) be a pair of jointly distributed random variables, and let Dn be a sample of size n from the distribution of (X, Y ). Let {(ki, ℓi)} i=1 Z+ Z+ be a sequence of coordinates with the property that for every number B there exists an index q(B) such that

{(ki, ℓi) : i q(B)} = {(k, ℓ) : kℓ B} .

We deﬁne Bij = d [M](Dj)ki,ℓi, i.e., Bij is the ki, ℓi-th entry of the sample characteristic matrix evaluated on a sample of size j. We analogously deﬁne Aij = {c M}c(Dj)ki,ℓi, and we deﬁne ai = {M}c(X, Y )ki,ℓi. We observe that by Proposition 46, sup ai = sup{M}c(X, Y ) = MIC . It is straightforward to see that Aij Bij. Additionally, Lemma 45 shows that Aij ai in probability, and Corollary 26, which states that MICe is a consistent estimator of MIC , shows that B j = maxi j Bij MIC (X, Y ). In the notation of the lemma, it therefore follows that A j = maxi j Aij converges in probability to MIC (X, Y ) as well. But this means that the sub-sequence

A q(B(n)) = max i q(B(n)){c M}c(Dq(B(n)))ki,ℓi = max kℓ B(n){c M}c(Dq(B(n)))k,ℓ

converges in probability to MIC (X, Y ), which implies the result since the sequence A j is monotone.

H.2.2 Consistency for Total Information Coefficient

Similarly to the consistency argument for MIC , we begin by exhibiting the relevant property of the population clumped equicharacteristic matrix.

Proposition 49 Let (X, Y ) be a pair of jointly distributed random variables. If X and Y are statistically independent, then {M}c(X, Y ) 0. If not, then there exists some a > 0 and some integer ℓ0 2 such that

{M}c(X, Y )k,ℓ a log min{k, ℓ}

either for all k ℓ ℓ0, or for all ℓ k ℓ0.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Proof (Sketch) Let {M}c = {c M}c(X, Y ). Under independence, every entry of {M}c is zero since I((X, Y )|G) = 0 for any grid G. For the case of dependence, the argument is identical to that given in the proof of Proposition 32. Speciﬁcally, it can be shown that there exists some index ℓ0, taken without loss of generality to be a column index, and some r > 0 such that all but ﬁnitely many of the entries in the ℓ0-column are at least r. It can then be shown that for large k, the entries (k, ℓ0), (k, ℓ0 + 1), . . . , (k, k) have non-decreasing values of I[c ]. This establishes the claim for a = r log ℓ0.

We now show that the above result, together with the uniform convergence of {c M}c(Dn) to {M}c(X, Y ), implies the consistency we seek.

Proposition 50 The function

] TICe,B( ) = X

kℓ B(n) {c M}c( )k,ℓ

yields a consistent right-tailed test of independence provided ω(1) < B(n) O(n1 ε) for some ε > 0, where {c M}c( ) is the output of the the Equichar Clump algorithm.

Proof Let (X, Y ) a pair of jointly distributed random variables, and let Dn be a sample of size n from the distribution of (X, Y ). It suﬃces to show consistency for any deterministic monotonic function of the statistic in question. We therefore choose to analyze ] TICe,B(Dn) log(B(n))/B(n). For the null hypothesis in which X and Y are independent, we observe that since {c M}c(Dn) d [M](Dn) element-wise, 0 ] TICe,B(Dn) TICe,B(Dn) as well. Moreover, the argument given in Appendix K, which shows that TICe,B(Dn)/B(n) converges to 0 in probability under the null hypothesis, can be adapted to show that TICe,B(Dn) log(B(n))/B(n) 0 as well. Thus, ] TICe,B(Dn) log(B(n))/B(n) converges to zero in probability, as required. For the case that X and Y are dependent, the proof is analogous to the argument given in Appendix K for TICe. The only diﬀerence is that Lemma 44, which guarantees the uniform convergence of {c M}c(Dn) to {M}c(X, Y ), applies only to the k, ℓ-th entries for which k, ℓ p

B(n), rather than the entries over which we are summing, which are those for which kℓ B(n). However, since we require only a lower bound on ] TICe,B(Dn), we may neglect these entries because

] TICe,B(Dn) = X

kℓ B(n) {c M}c(Dn)k,ℓ X

B(n) {c M}c(Dn)k,ℓ.

It can then be shown, following the argument from Appendix K, that there exists some a > 0 depending only on B such that, with probability 1 o(1),

B(n) {c M}c(X, Y )k,ℓ ] TICe,B(Dn)

O #n log B(n)

= O log B(n)

Measuring Dependence Powerfully and Equitably

where #n = B(n) represents the number of pairs (k, ℓ) such that k, ℓ p

B(n). To obtain the result, we note that this means that

B(n) ] TICe,B(Dn) log B(n)

B(n) {c M}c(X, Y )k,ℓ O log B(n)

and then invoke Proposition 49, which implies that for large n

B(n) {M}c(X, Y ) Ω B(n) log B(n)

H.3 Empirical Characterization of the Performance of Equichar Clump

The Equichar Clump algorithm has a parameter c that controls the ﬁneness of the equipartition whose sub-partitions are searched over by the algorithm. To gain an empirical understanding of the eﬀect of c on performance, we computed MICe on the set of relationships described in Section 4.4 using Equichar Clump with diﬀerent values of c. For each relationship, we compared the average MICe across all 500 independent samples from that relationship with diﬀerent values of c. We performed this analysis at sample sizes of n = 250 (Figure H1), n = 500 (Figure H2), and 5, 000 (Figure H3). We summarize our ﬁndings as follows.

At low (n = 250) and medium (n = 500) sample sizes, using c = 1 introduces a downward bias for more complex relationships when B(n) = n0.6 is used but not when B(n) = n0.8 is used. This makes sense since the low sample size and low setting of B(n) mean that the algorithm is searching over grids with relatively few cells, and so setting c = 1 hinders its ability to ﬁnd good grids in this limited search space. This bias is almost entirely alleviated by setting c 2.

At high sample size (n = 5, 000), this eﬀect is still observable but much reduced. This makes sense since when n is large, B(n) is large as well, and so the number of cells allowed in the grids being searched over is already large regardless of the exponent α used in B(n) = nα. Thus, there is less need for the robustness provided by searching for an optimal grid.

Appendix I. Equitability and Power Analyses from Reshef et al. (2015a)

Figure I1 contains a representative equitability analysis from Reshef et al. (2015a). Figure I2 contains power curves from Reshef et al. (2015a) for a large set of leading methods.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Figure H1: The eﬀect of the parameter c on the performance of Equichar Clump, at n = 250. See Section H.3 for details.

Measuring Dependence Powerfully and Equitably

Figure H2: The eﬀect of the parameter c on the performance of Equichar Clump, at n = 500. See Section H.3 for details.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Figure H3: The eﬀect of the parameter c on the performance of Equichar Clump, at n = 5, 000. See Section H.3 for details.

Measuring Dependence Powerfully and Equitably

Figure I1: (Reproduced from Reshef et al., 2015a.) The equitability of measures of dependence on a set of noisy functional relationships, reproduced from Reshef et al. (2015a). [Narrower is more equitable.] The plots were constructed as in Figure 3. Mutual information, estimated using the Kraskov estimator, is represented using the squared Linfoot correlation.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Figure I2: (Reproduced from Reshef et al., 2015a.) Power of independence testing using several leading measures of dependence, on the relationships chosen by Simon and Tibshirani (2012), at 50 noise levels with linearly increasing magnitude for each relationship and n = 500. To enable comparison of power regimes across relationships, the x-axis of each plot lists R2 rather than noise magnitude.

Measuring Dependence Powerfully and Equitably

Appendix J. Equitability Analysis of Randomly Chosen Functions with Additional Noise Model

Figure J1 contains a version of the main text Figure 4, but where noise has been added only to the dependent variable in each functional relationship, rather to both the independent and dependent variables.

Appendix K. Consistency of Independence Testing Based on TICe

Here we prove Propositions 32 and 33 and then use those propositions to prove Theorem 34, which shows that TICe can be used for independence testing.

K.1 Proof of Proposition 32

Proposition Let (X, Y ) be a pair of jointly distributed random variables. If X and Y are statistically independent, then M(X, Y ) [M](X, Y ) 0. If not, then there exists some a > 0 and some integer ℓ0 2 such that

M(X, Y )k,ℓ, [M](X, Y )k,ℓ a log min{k, ℓ}

either for all k ℓ ℓ0, or for all ℓ k ℓ0.

Proof We give the proof only for [M] = [M](X, Y ), with the understanding that all parts of the argument are either identical or similar for M(X, Y ). When X and Y are independent, then for any grid G, (X, Y )|G exhibits independence as well. Therefore I((X, Y )|G) = 0 for all grids G, and so every entry of [M], being a supremum over such quantities, is 0. For the case that X and Y are dependent, our strategy is to ﬁrst ﬁnd, without loss of generality, a column of [M] almost all of whose values are bounded away from zero, and then argue that this suﬃces. The dependence of X and Y implies that MIC (X, Y ) > 0. By Corollary 22, which states that sup [M] = MIC (X, Y ), we therefore know that there is at least one non-zero element of the boundary of [M], as deﬁned in Deﬁnition 14. Without loss of generality, suppose that this element is [M] ,ℓ0 = limk [M]k,ℓ0. The fact that this limit is strictly positive implies that there exists some k0 ℓ0 and some r > 0 such that [M]k,ℓ0 r for all k k0. That is, all but ﬁnitely many of the entries in the ℓ0-th column of [M] are at least r. We now show that the existence of such a column suﬃces to prove the claim. Fix some k > k0 and note that this implies that k > ℓ0. We argue that for all ℓin {ℓ0, . . . , k}, the desired condition holds. Since k > ℓ0, the term I[ ]((X, Y ), k, ℓ0) in the deﬁnition of [M]k,ℓ0 is a maximization over grids that have an equipartition of size k on one axis and an optimal partition of size ℓ0 on the other. Since we allow empty rows/columns in the maximization, substituting any ℓsatisfying ℓ0 ℓ k therefore does not constrain the maximization in any way and so it cannot decrease I[ ]. In other words, for ℓsatisfying ℓ0 ℓ k, we have

I[ ]((X, Y ), k, ℓ) I[ ]((X, Y ), k, ℓ0).

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Figure J1: Equitability of methods examined on functions randomly drawn from a Gaussian process distribution, using a diﬀerent noise model. This ﬁgure is identical to Figure 4, but with noise added only to the dependent variable in each relationship. Each method is assessed as in Figure 4, with a red interval indicating the widest range of R2 values corresponding to any one value of the statistic; the narrower the red interval, the higher the equitability. Sample relationships for each Gaussian process bandwidth are shown in the top right with matching colors.

Measuring Dependence Powerfully and Equitably

Since k ℓ, ℓ0, the normalizations in the deﬁnition of [M]k,ℓand [M]k,ℓ0 are log ℓand log ℓ0 respectively. Therefore, we have that

[M]k,ℓ [M]k,ℓ0 log ℓ0

log ℓ r log ℓ0

log ℓ where the last inequality is because k > k0. Setting a = r log ℓ0 then gives the result.

K.2 Proof of Proposition 33

Proposition Let (X, Y ) be a pair of jointly distributed random variables. If X and Y are statistically independent, then SB(M(X, Y )) = SB([M](X, Y )) = 0 for all B > 0. If not, then SB(M(X, Y )) and SB([M](X, Y )) are both Ω(B log log B).

Proof We give the argument for M = M(X, Y ) only, but the argument holds as stated for [M](X, Y ) as well. The result follows from the guarantee given by the Proposition 32 above. In the case of independence, the proposition tells us that M 0, which immediately gives that SB(M) = 0 for all B > 0. For the case of dependence, the proposition implies that there is some a > 0 and some integer ℓ0 2 such that, without loss of generality, Mk,ℓ a/ log ℓfor all k ℓ ℓ0. We convert this into a lower bound on SB(M). The key is to write the sum one column at a time, counting how many entries in each column both satisfy k ℓ ℓ0 and kℓ B. For any ℓsatisfying ℓ0 ℓ

B, the entries (ℓ, ℓ), . . . , (B/ℓ, ℓ) meet this criterion, and there are B/ℓ0 (ℓ0 1) of them. Moreover, since the guarantee of Proposition 32 tells us that all of these entries are at least a/ log ℓ, we can lower-bound SB(M) as follows.

1 ℓlog ℓ O(B)

= Ω(B log log B)

where the second-to-last equality is because (ℓ 1)/ log ℓ ℓ, and the last equality is because Pn i=i0 1/(i log i) grows like log log n.

K.3 Proof of Theorem 34

Theorem The statistics TICB and TICe,B yield consistent right-tailed tests of independence, provided ω(1) < B(n) O(n1 ε) for some ε > 0.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Proof We give the proof for TIC only; however, the argument holds as stated for TICe as well. Let (X, Y ) be jointly distributed random variables, and let Dn be a sample of size n from the distribution of (X, Y ). Let M = M(X, Y ) be the characteristic matrix of (X, Y ) and let c M(Dn) be the sample characteristic matrix. It suﬃces to establish the result for a deterministic monotonic function of TICB(Dn). We therefore show convergence of TICB(Dn)/B(n) to zero under the null hypothesis of independence and to under any alternative. Our general strategy for doing so is to translate known bounds on our error at estimating entries of M into bounds on the diﬀerence between TICB(Dn)/B(n) = SB(n)(c M(Dn))/B(n) and SB(M)/B(n). We then obtain the result by invoking Proposition 33, which implies that SB(M)/B(n) is zero under the null hypothesis but grows without bound under the alternative. We know from Lemma 38 (Lemma 42 for the equicharacteristic matrix) that there exists some a > 0 depending only on B such that c M(Dn)k,ℓ Mk,ℓ O 1

for all kℓ B(n) with probability 1 o(1). This means that with probability 1 o(1) we have 1 B(n)

TICB(Dn) SB(n)(M) O #n B(n)na

where #n is the number of pairs (k, ℓ) such that kℓ B(n). It can be shown by taking the integral of B/x with respect to x that #n = O(B(n) log B(n)). Therefore, the error in the above bound is at most O(log B(n)/na) = O(1/poly(n)) for our choice of B(n). We now use Proposition 33 to show that this bound gives the desired result. Under the null hypothesis of independence, the proposition says that SB(n)(M) = 0 always, and so since B is a growing function the bound implies that TICB(Dn)/B(n) 0 in probability. Under the alternative hypothesis in which (X, Y ) exhibit a dependence, the proposition implies that SB(n)(M)/B(n) > ω(1). Since B is a growing function of n, this means that for any r > 0, the probability that SB(n)(M)/B(n) > r goes to 1 as n grows. In other words, TICB(Dn)/B(n) in probability.

Appendix L. Information-Theoretic Lemmas

Lemma 51 Let Π and Ψ be random variables distributed over a discrete set of states Γ, and let (πi) and (ψi) be their respective distributions. Let P = f(Π) and Q = f(Ψ) for some function f whose image is of size B. Deﬁne

Then for every 0 < a < 1 there exists some A > 0 such that

|H(Q) H(P)| (log B) A X

when |εi| 1 a for all i.

Measuring Dependence Powerfully and Equitably

Proof We prove the claim with entropy measured in nats. A rescaling then gives the general result. Let (pi) and (qi) be the distributions of P and Q respectively, and deﬁne

analogously to εi. Before proceeding, we observe that

We now proceed with the argument. We have from Roulston (1999) that

|H(Q) H(P)|

eipi(1 + ln pi) + 1

2e2 i pi + O e3 i (3)

i eipi ln pi

i O e3 i (4)

i eipi ln pi

i e2 i pi +

i O e3 i (5)

where the ﬁnal equality is because P i eipi = P i qi P i pi = 0. We proceed by bounding each of the terms in Equation 5 separately. To bound the ﬁrst term, we write

i eipi ln pi

i |ei|pi ln pi.

We then note that P i pi ln pi ln B, and since each of the summands has the same sign this means that pi ln pi ln B. We also observe that

since πj/pi 1. Together, these two facts give

i |ei|pi ln pi (ln B) X

The second inequality is because each ei is a weighted average of a set of εi and each εi enters into the expression of exactly one ei. To bound the second term, we use the fact that pi 1 for all i, and so X

i e2 i pi X

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

We then write

where the second line is a consequence of the convexity of f(x) = x2 and the third line is because the sets f 1(i) partition Γ. To bound the third term, we write

and then proceed as we did with the second term, using the fact that f(x) = x3 is convex for x 0. This gives X

i O |ei|3 X

i O |εi|3 = X

completing the proof.

Lemma 52 Let {wi} [0, 1] be a set of size n with P i wi 1, and let {ui} be a set of n non-negative numbers satisfying P i ui = a and ui wi. Then

where Hb is the binary entropy function.

Proof Consider the random variable X taking values in {0, . . . , n} that equals zero with probability 1 P i wi and equals i with probability wi for 0 < i n. Deﬁne the random variable Y taking values in {0, 1} by

P (Y = 0|X = i) = 0 i = 0 ui/wi 0 < i n .

The function we wish to bound equals H(Y |X) H(Y ). We therefore observe that

Measuring Dependence Powerfully and Equitably

The result follows from the observation that

P (Y = 0) = X

i P (X = i) ui

Lemma 53 Let X be a random variable distributed over k states, with P (X = x) = px. Let αx 0 be such that P αx = δ, and deﬁne the random variable X by P (X = x) = (px + αx)/(1 + δ). We have H(X ) H(X) Hb(δ) + δ log k

where Hb is the binary entropy function.

Proof Deﬁne a new random variable Z by

P Z = 0|X = x = px px + αx , P Z = 1|X = x = αx px + αx .

We will use the fact that H(X |Z = 0) = H(X) to obtain the required bound. To upper bound H(X ) H(X), we write

H(X ) H(X) H(X , Z) H(X)

= H(Z) + P (Z = 0) H(X |Z = 0) + P (Z = 1) H(X |Z = 1) H(X)

Hb(δ) + (1 δ)H(X) + δH(X |Z = 1) H(X)

= Hb(δ) δH(X) + δ log k

Hb(δ) + δ log k

where in the fourth line we have used that H(X |Z = 1) log k. To upper bound H(X) H(X ), we write

H(X ) + H(Z) H(X , Z)

P (Z = 0) H(X |Z = 0)

= (1 δ)H(X)

which yields H(X ) (1 δ)H(X) Hb(δ)

since H(Z) = Hb(δ). Thus, we have

H(X) H(X ) δH(X) + Hb(δ) δ log k + Hb(δ).

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Lemma 54 Let X be a random variable distributed over k states, with P (X = x) = px. Let αx 0 be such that P |αx| = δ, and deﬁne the random variable X by P (X = x) = (px + αx)/(1 δ). We have

H(X ) H(X) Hb

+ δ 1 δ log k

where Hb is the binary entropy function. In particular, when δ 1/3 we have H(X ) H(X) Hb(2δ) + 2δ log k.

Proof We observe that we can get from X to X by adding δ/(1 δ) probability mass and rescaling. The previous lemma then gives the result.

Lemma 55 Let X be a random variable distributed over k states, with P (X = x) = px. Let αx be such that P |αx| = δ, and deﬁne the random variable X by P (X = x) = (px + αx)/(1 P αx). That is, X is the result of changing the probability of state x by αx and then re-normalizing to obtain a valid distribution. If δ 1/4, we have H(X ) H(X) 2Hb(2δ) + 3δ log k

where Hb is the binary entropy function.

Proof Let δ+ be the total magnitude of all the positive αx, and let δ be the total magnitude of all the negative αx. We ﬁrst add all the mass we re going to add, and apply the ﬁrst of the previous two lemmas. Then we remove all the mass we are going to remove, and apply the second of the two previous lemmas. This yields a bound of

Hb(δ+) + δ+ log k + Hb

+ 2 δ 1 + δ+ log k

Hb(δ+) + δ+ log k + Hb(2δ ) + 2δ log k

Hb(2δ) + δ log k + Hb(2δ) + 2δ log k

2Hb(2δ) + 3δ log k

where the ﬁrst inequality is because 1 + δ+ 1 + δ < 2 and 2δ 2δ 1/2, and the second inequality is because δ+ δ < 2δ 1/2.

Measuring Dependence Powerfully and Equitably

Leo Breiman and Jerome H. Friedman. Estimating optimal transformations for multiple regression and correlation. Journal of the American statistical Association, 80(391):580 598, 1985.

William S. Cleveland and Susan J. Devlin. Locally weighted regression: an approach to regression analysis by local ﬁtting. Journal of the American Statistical Association, 83 (403):596 610, 1988.

Thomas Cover and Joy Thomas. Elements of Information Theory. New York: John Wiley & Sons, Inc, 2006.

Imre Csiszár. Axiomatic characterizations of information measures. Entropy, 10(3):261 273, 2008.

Imre Csiszár and Paul C. Shields. Information theory and statistics: A tutorial. Communications and Information Theory, 1(4):417 528, 2004.

A. Adam Ding and Yi Li. Copula correlation: An equitable dependence measure and extension of pearson s correlation. ar Xiv preprint ar Xiv:1312.7214, 2013.

Valur Emilsson, Gudmar Thorleifsson, Bin Zhang, Amy S Leonardson, Florian Zink, Jun Zhu, Sonia Carlson, Agnar Helgason, G Bragi Walters, Steinunn Gunnarsdottir, et al. Genetics of gene expression and its eﬀect on disease. Nature, 452(7186):423 428, 2008.

Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In Algorithmic learning theory, pages 63 77. Springer, 2005.

Arthur Gretton, Kenji Fukumizu, Choon H Teo, Le Song, Bernhard Schölkopf, and Alex J Smola. A kernel statistical test of independence. In Advances in neural information processing systems, pages 585 592, 2007.

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

Ruth Heller, Yair Heller, and Malka Gorﬁne. A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2):503 510, 2013.

Ruth Heller, Yair Heller, Shachar Kaufman, Barak Brill, and Malka Gorﬁne. Consistent distribution-free k-sample and independence tests for univariate random variables. Journal of Machine Learning Research, 17(29):1 54, 2016.

Wassily Hoeﬀding. A non-parametric test of independence. The Annals of Mathematical Statistics, pages 546 557, 1948.

Bo Jiang, Chao Ye, and Jun S Liu. Nonparametric k-sample tests via dynamic slicing. Journal of the American Statistical Association, 110(510):642 653, 2015.

Reshef, Reshef, Finucane, Sabeti, and Mitzenmacher

Justin B. Kinney and Gurinder S. Atwal. Equitability, mutual information, and the maximal information coeﬃcient. Proceedings of the National Academy of Sciences, 2014. doi: 10.1073/pnas.1309933111.

Alexander Kraskov, Harald Stogbauer, and Peter Grassberger. Estimating mutual information. Physical Review E, 69, 2004.

Edward H. Linfoot. An informational measure of correlation. Information and Control, 1 (1):85 89, 1957.

David Lopez-Paz, Philipp Hennig, and Bernhard Schölkopf. The randomized dependence coeﬃcient. In Advances in Neural Information Processing Systems, pages 1 9, 2013.

Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005.

Ben Murrell, Daniel Murrell, and Hugh Murrell. R2-equitability is satisﬁable. Proceedings of the National Academy of Sciences, 2014. doi: 10.1073/pnas.1403623111. URL http: //www.pnas.org/content/early/2014/04/29/1403623111.short.

Liam Paninski. Estimation of entropy and mutual information. Neural computation, 15(6): 1191 1253, 2003.

Alfred Rényi. On measures of dependence. Acta mathematica hungarica, 10(3):441 451, 1959.

David N. Reshef, Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean Mc Vean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and Pardis C. Sabeti. Detecting novel associations in large data sets. Science, 334(6062):1518 1524, 2011.

David N. Reshef, Yakir A. Reshef, Michael Mitzenmacher, and Pardis C. Sabeti. Cleaning up the record on the maximal information coeﬃcient and equitability. Proceedings of the National Academy of Sciences, 2014. doi: 10.1073/pnas.1408920111. URL http: //www.pnas.org/content/early/2014/08/07/1408920111.short.

David N. Reshef, Yakir A. Reshef, Pardis C. Sabeti, and Michael Mitzenmacher. An empirical study of leading measures of dependence. ar Xiv preprint ar Xiv:1505.02214, 2015a.

Yakir A Reshef, David N Reshef, Pardis C Sabeti, and Michael Mitzenmacher. Equitability, interval estimation, and statistical power. ar Xiv preprint ar Xiv:1505.02212, 2015b.

Mark S. Roulston. Estimating the errors on measured entropy and mutual information. Physica D: Nonlinear Phenomena, 125(3):285 294, 1999.

Noah Simon and Robert Tibshirani. Comment on Detecting novel associations in large data sets . Unpublished (available at http://www-stat.stanford.edu/ tibs/reshef/comment.pdf on 11 Nov. 2012), 2012.

Terry Speed. A correlation for the 21st century. Science, 334(6062):1502 1503, 2011.

Josef Stoer and Roland Bulirsch. Introduction to Numerical Analysis. Springer-Verlag, 1980.

Measuring Dependence Powerfully and Equitably

Charles J. Stone. Consistent nonparametric regression. The annals of statistics, pages 595 620, 1977.

John D. Storey and Robert Tibshirani. Statistical signiﬁcance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16):9440 9445, 2003.

Gábor J. Székely and Maria L. Rizzo. Brownian distance covariance. The Annals of Applied Statistics, 3(4):1236 1265, 2009.

Gábor J. Székely, Maria L. Rizzo, Nail K. Bakirov, et al. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769 2794, 2007.