# differentially_private_maximal_information_coefficients__a30329e6.pdf Differentially Private Maximal Information Coefficients John Lazarsfeld 1 Aaron Johnson 2 Emmanuel Adeniran 1 The Maximal Information Coefficient (MIC) is a powerful statistic to identify dependencies between variables. However, it may be applied to sensitive data, and publishing it could leak private information. As a solution, we present algorithms to approximate MIC in a way that provides differential privacy. We show that the natural application of the classic Laplace mechanism yields insufficient accuracy. We therefore introduce the MICr statistic, which is a new MIC approximation that is more compatible with differential privacy. We prove MICr is a consistent estimator for MIC, and we provide two differentially private versions of it. We perform experiments on a variety of real and synthetic datasets. The results show that the private MICr statistics significantly outperform direct application of the Laplace mechanism. Moreover, experiments on real-world datasets show accuracy that is usable when the sample size is at least moderately large. 1. Introduction The Maximal Information Coefficient (MIC) is a powerful and relatively new tool to detect correlations in data (Reshef et al., 2011; 2016). MIC uses mutual information to detect general dependencies between numeric attributes, in contrast to a more common statistic such as Pearson s correlation coefficient, which is only designed to detect linear relationships. MIC is thus particularly suited to identify novel relationships in complex data, as in that setting it is unknown which properties might be related and how. However, many datasets that are valuable for such data mining (such as medical or economic data) contain sensitive personal information. Moreover, even publishing just the statistics that result from a correlation analysis can reveal 1Department of Computer Science, Yale University 2U.S. Naval Research Laboratory. Correspondence to: John Lazarsfeld . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). private details of the individuals comprising the data (Homer et al., 2008; Wang et al., 2009). A scientist or government analyst must therefore consider not only the effectiveness of their statistical techniques, but they must also take into account if the resulting statistics can be published in a way that protects privacy. Differential privacy (DP) (Dwork & Roth, 2014) has emerged as the leading method for privacy-preserving data publishing. Major companies and institutions use differential privacy to publish statistics about their sensitive data (Cormode et al., 2018; Dwork, 2019). MIC appears at first to be well-suited to being published in a differentially private way. At a high level, for a pair of numeric variables, it measures the maximum mutual information over possible grids partitioning their joint range. For any given grid, the effect of changing just one data point, which is the the main criterion for differential privacy, changes the distribution of points very little. This fact suggests that a differentially private MIC could be designed with high accuracy. However, MIC is difficult to compute, and instead it is suggested to approximate it using the MICe statistic (Reshef et al., 2016). MICe is efficiently computable because it restricts its optimization to subgrids of a master mass equipartition of the dataset, that is, a grid in which each row and column contains the same number of data points. This master grid depends on the data, though, and we obtain a bound on the change in MICe from altering one data point that is significantly higher than would be expected for MIC. We therefore investigate a new method to approximate MIC that is less sensitive to small changes in the dataset. We propose the MICr statistic, which optimizes over subgrids of a master range equipartition, that is, a grid in which the rows and columns divide the range equally. MICr is efficiently computable, and we obtain a bound on its sensitivity to input perturbations that is lower than our bound for MICe, both asymptotically and concretely. We prove that MICr converges in probability to MIC (or, more properly, to the analogous statistic defined over distributions). We then present two differentially private versions of MICr, representing fundamentally different approaches to adding DP noise. MICr-Lap uses the classic Laplace mechanism, which adds random noise to the function output (in our case, the non-private MICr value). In contrast, MICr-Geom Differentially Private Maximal Information Coefficients perturbs the input, accomplished by adding a random count sampled from the geometric distribution to each grid cell, before then computing MICr. We prove that the error added to MICr by each of these mechanisms goes to zero as the size of the dataset increases. Finally, we implement and experimentally analyze the two MICr mechanisms and the naive application of the Laplace mechanism to MICe (MICe-Lap). Our experiments use the synthetic and real datasets used to evaluate the original MIC statistics. The results show that the MICr mechanisms both significantly outperform MICe-Lap. Comparing the MICr mechanisms, we observe that MICr-Lap has lower bias but higher variance than MICr-Geom. Moreover, the experiments on real-world datasets show usable accuracy for a reasonable level of privacy when the sample size is at least moderately large. For example, with ϵ = 1 we obtain average errors as low as 0.068 and 0.016 on datasets with 337 and 4381 datapoints, respectively, where MIC itself can range from 0 to 1 (low to high correlation). 2. Preliminaries We begin by introducing notation and concepts that are used to define MIC and its related approximations. 2.1. Datasets, Grids, and Distributions We define a dataset D as a sequence of n points in R2. To distinguish between x and y coordinates, we write D = (Dx, Dy), where Dx, Dy are sequences of n points in R. For integers k, ℓ 2, a k ℓgrid G = (P, Q) on R2 is comprised of a size-k partition P = {P1, . . . , Pk} of the y-axis, and a size-ℓpartition Q = {Q1, . . . , Qℓ} of the xaxis. We say that a k ℓgrid G has kℓtotal cells, and we let G(k, ℓ) denote the set of all grids with kℓtotal cells. For k j, a single-axis partition P = {P1, . . . , Pk}, is a subpartition of C = {C1, . . . , Cj} (denoted by P C) if every Pi is the union of adjacent intervals in C. For example, let P={[0, 0.2), [0.2, 0.5), [0.5, 0.8), [0.8, 1]} be a size-4 partition of the interval [0, 1]. Then Q={[0, 0.5), [0.5, 1]} is a size-2 subpartition of P because each element of Q is the union of two elements of P that are adjacent on [0, 1]. We call P a size-k range equipartition of an interval I = [x0, x1] when P is a size-k partition of I and all parts of P are intervals of length (x1 x0)/k, We call a grid G = (P, Q) a subgrid of Γ = (Cx, Cy) (denoted by G Γ) if P and Q are subpartitions of Cx and Cy respectively. For a k ℓgrid G = (P, Q), and for a point d = (dx, dy) R2, we define the point-mapping function φ, where φ(d, G) = (i, j) iff dx Qj and dy Pi. Then for a dataset D of n points and a k ℓgrid G, we define AD,G Zk ℓas the count matrix for D and G. For all Figure 1. An example dataset of n=8 points (black dots). The pink vertical lines represent a size-4 mass equipartition, and the blue vertical lines represent a size-4 range equipartition of the interval [0, 1]. The first matrix shows the count matrix for the grid comprised of the pink column partition and green row partition. The second matrix shows the count matrix for the grid comprised of the blue column partition and green row partition. Notice how the count matrix can vary depending on whether the column partition is a mass (pink) or range (blue) equipartition. The final matrix is the normalized count matrix for the blue/green grid. (i, j) [k] [ℓ], AD,G has entries a(i, j) = |{d D : φ(d, G) = (i, j)}| , i,j a(i, j) = n. For a k ℓgrid G = (P, Q), we call P (resp. Q) a mass equipartition if all row sums (resp. column sums) of AD,G are equal. If G is a subgrid of a grid Γ, observe that each entry of AD,G can be generated by summing adjacent entries in AD,Γ, and we write ψ(AD,Γ, Γ, G) = AD,G to denote the matrix whose entries are generated via this process. Given AD,G, the matrix PD,G Rk ℓwith entries p(i, j) is the normalized count matrix for D and G, where PD,G = (1/n) AD,G. We use p(i, ) = P j p(i, j) and p( , j) = P i p(i, j) to denote the row and column sums of PD,G respectively. Differentially Private Maximal Information Coefficients Figure 1 shows an example dataset to illustrate the differences between mass and range-based equipartitions and their resulting count matrices. For a fixed D and G of size k ℓ, we write D|G (read as D partitioned by G) to denote the joint distribution over the space [k] [ℓ], with a probability mass function (PMF) given by the entries of PD,G. Then the discrete mutual information of D|G (equivalently, of PD,G) is computed by I(D|G) = I(PD,G) = X i,j p(i, j) log2 p(i, j) p(i, )p( , j). We then define the normalized I (D|G) by I (D|G) := I(D|G) log2 min{k, ℓ}, which ensures I ( ) is always in the range [0, 1]. For a jointly distributed pair of random variables Π = (X, Y ) and a k ℓgrid G, we define Π|G and PΠ,G similarly, where PΠ,G has entries p(i, j) = Pr(X Qj, Y Pi). So Π|G is a discrete probability distribution over [k] [ℓ], and we compute I (Π|G) as above. 2.2. The MIC, MIC*, and MICe Statistics Given a dataset D, the MIC statistic measures correlation by finding the grid G where I (D|G) is maximized. If the variables represented by D have some relationship, MIC will identify the grid containing the intervals where each variable gives the most information about the other. For any D with n points, it is easy to see that (assuming no points with the same x or y value) there exists an n n grid G that separates each point into its own cell, which means I (D|G) = 1. On the other hand, grids with too few total cells may not be able to capture more complex relationships. To negotiate this tradeoff, the MIC statistic requires a maximum grid size parameter B := B(n), where only grids with at most B(n) cells are considered, but where B(n) is expected to grow with n. Defined formally: Definition 2.1 (MIC statistic (Reshef et al., 2011)). For a dataset D of size n and B := B(n), let MG D denote its characteristic matrix with (k, ℓ) entries max G G(k,ℓ) I (D|G). Then MIC(D, B) = maxk,ℓ: kℓ B(n) MG D When the dataset D can be modeled as a sample of points drawn from a joint distribution Π, then MIC can be viewed as an estimator of the analogous MIC* statistic defined for the distributional setting: Definition 2.2 (MIC* statistic (Reshef et al., 2016)). For a jointly-distributed pair of random variables Π = (X, Y ), let MG Π denote its characteristic matrix with (k, ℓ) entries max G G(k,ℓ) I (Π|G). Then MIC*(Π) = sup MG Π. When B(n) = O(nα) for some α (0, 0.5), MIC is also a consistent estimator of MIC*, meaning that for a dataset Dn of n points drawn i.i.d. from Π, MIC(Dn, B) converges in probability to MIC*(Π) as n (Reshef et al., 2016; Lazarsfeld & Johnson, 2021). However, for every k, ℓwith kℓ B(n), MIC is a maximization over the set G(k, ℓ) of all grids with kℓcells, which for datasets of size n makes computing MIC infeasible in practice. To this end, Reshef et al. (2016) introduced an efficientlycomputable approximation of MIC called MICe. This newer statistic approximates MIC by defining its characteristic matrix entries as maximizations over dataset-dependent subsets of G(k, ℓ). Specifically, for a dataset D = (Dx, Dy), a constant c > 0, and for all 2 k ℓ, the set E(D, c, k, ℓ) contains all grids G = (P, Q) where Q is a size-ℓmass equipartition of Dx and P is a size-k subpartition of a sizecℓmass equipartition of Dy. The set is defined symmetrically for k > ℓ. Then MICe is defined similarly to MIC but replaces the maximizations over G(k, ℓ) with E(D, c, k, ℓ): Definition 2.3 (MICe statistic, Reshef et al. (2016)). For any dataset D of n points, constant c > 0, and B := B(n), let ME(D,c) D be the equicharacteristic matrix with (k, ℓ) entries max G E(D,c,k,ℓ) I (D|G). Then MICe(D, B, c) = max k,ℓ: kℓ B(n) Each (k, ℓ) entry of the equicharacteristic matrix can be reduced to a maximization defined only over subpartitions on a single axis. This maximization can be computed efficiently using the dynamic-programming algorithm OPTIMIZEAXIS of Reshef et al. (2011). Then in total, MICe(D, B, c) can be computed in O(c2B4) time for any B and c, and this becomes O(c2n4α) when B(n) = O(nα). Here, the constant c can be viewed as an approximation parameter, where a larger value leads to a better approximation of MIC, but at the expense of slower computation1. More details of computing MICe are given in Appendix A.1. In addition to its efficiency, MICe is still a consistent estimator of MIC* when B(n) = O(nα) for α (0, 0.5) (Reshef et al., 2016; Lazarsfeld & Johnson, 2021). In practice, Reshef et al. (2018; 2016) suggested using α=0.6 and c=15 after evaluating the statistic on synthetic and real data, and so we treat these as its default settings. Given its computational efficiency and consistency, we consider MICe as a starting point for developing private approximations of MIC, but we first briefly recall a few definitions and tools related to differential privacy. 1Note that Reshef et al. (2016) originally defined MICe with a more complicated dependence on the parameter B to constrain the maximization space. The result is faster computation but with lower accuracy for a given n, though in practice the two variants are similar. We discuss this further in Appendix E.1. Differentially Private Maximal Information Coefficients 2.3. Differential Privacy For two datasets D = (d1, . . . , dn) and D = (d 1, . . . , d m), we say D and D are neighboring (denoted D D ) if n = m and there exists at most one index i where di = d i. Differential privacy ensures that the output distributions of a (randomized) algorithm are similar when run on D D : Definition 2.4 (Dwork & Roth (2014)). For ϵ > 0, a randomized algorithm A : R2 n R is ϵ-differentially private (ϵ-DP) if, for every D D and every I R, Pr(A(D) I) exp(ϵ) Pr(A(D ) I). In our setting, using an ϵ-DP mechanism to estimate MIC implies that the output leaks little information about any di D (see 7 for more on the privacy semantics). One common approach for designing ϵ-DP mechanisms is to add random noise to the output of a non-private function. When doing so, an important consideration is the sensitivity of the function, which is the maximum possible change in function value over neighboring D D : Definition 2.5 (Sensitivity). The (ℓ1) sensitivity of a function f : R2 n R is max D D |f(D) f(D )|. For a function f with sensitivity , the Laplace mechanism of Dwork et al. (2006) adds noise from a zero-mean Laplace distribution with parameter /ϵ. The result is ϵ-DP. For convenience, we will write Lap(b) to denote a random variable with a zero-mean Laplace distribution with parameter b, which has density f(x) = e |x|/b/(2b). Also, for x R, [x]0,1 denotes that x is trunctated to the range [0, 1]. 3. MICe-Lap Mechanism We begin by considering the compatability of MICe with the standard Laplace mechanism. Doing so requires obtaining a bound on the sensitivitiy of MICe, which for datasets of size n, B(n), and c > 0 we denote by n(MICe, B, c). Our first result gives an upper bound on this quantity. Theorem 3.1. For any B := B(n), c > 0, and n 6, n(MICe, B, c) B ((2 log2 n)/n + 4.8/n). Using the suggested setting B(n) = nα (Reshef et al., 2011; 2016), n(MICe, B, c) = O((log2 n)/n1 α), which is asymptotically larger than the O(log2 n/n) sensitivity of MIC for α (0, 1) (this MIC sensitivity bound can be obtained from the proof of Theorem 4.3). Theorem 3.1 is proved in Appendix A.2, and the main intuition is that because MICe maximizes over a dataset-dependent set of grids, for D D , a worst-case change in I can occur when the grids for D have no cells in common with those of D , which changes I a bit for each of the at most B cells. This sensitivity bound then yields the ϵ-DP MICr-Lap mech- anism by adding Laplace noise and truncating: Mechanism 1 (MICe-Lap). For any dataset D of size n 6, B := B(n), c > 0, and ϵ > 0, let := B ((2 log2 n/n)+4.8/n). Then MICe-Lap(D, B, c, ϵ) := [MICe(D, B, c) + Lap( /ϵ)]0,1. Theorem 3.2. MICe-Lap( , B, c, ϵ) is ϵ-DP. Theorem 3.2 is proved in Appendix A. Given that the Laplace sampling can be done in constant time, the runtime of computing MICe-Lap is asymptotically equivalent to that of MICe because only output noise is added. Additionally, due to the [0, 1] truncation, the standard deviation of the mechanism is bounded by that of Lap( /ϵ). So for B = nα, the mechanism s standard deviation is at most ( 2/ϵ) ((2 log2 n/(n1 α)) + (4.8/n1 α)). While for a fixed ϵ this value decreases with n, using the suggested α=0.6, this quantity is 1.38 when ϵ=1 and n=5000. Given that the MICe value is in [0, 1], the outputs of MICe-Lap have intolerably high error, even for moderately large values of ϵ and n. This result motivates designing an alternative MIC approximation with lower-error private variants. 4. MICr Mechanisms 4.1. Non-private MICr Statistic Despite being efficiently computable and a consistent estimator, the dataset dependence and resulting high sensitivity of MICe precludes a straightforward, low-error private variant. To that end, we introduce the MICr approximation for MIC. It remains both efficiently computable and a consistent estimator of MIC*, but it optimizes over datasetindependent sets of grids, yielding a lower sensitivity. Indeed, its sensitivity matches that of MIC, making it more compatible with designing differentially private variants. We begin by describing the (non-private) MICr statistic before introducing two differentially private variants. The key difference between MICe and MICr is that the former maximizes over mass equipartitions while the latter maximizes over range equipartitions. Therefore, in addition to a maximum grid size parameter B and a finite c > 0, MICr also takes as a parameter a range L R2 of the form L = [x0, x1] [y0, y1]. When computing MICr, we require that the coordinates of all points of D lie within L. This requirement can be satisfied by most types of data, for example those with a defined range or using conservative maxima and minima estimates. L must be dataset-independent. For any D restricted to range L = Lx Ly and c > 0, the dataset-independent sets of grids used to define the entries of the analogous equicharacteristic matrix for MICr are defined as follows. First, for an interval I R, let RI,ℓdenote the size-ℓrange equipartition of I, and let P(I, k, [j]) denote the set of all size-k subpartitions of a Differentially Private Maximal Information Coefficients size-j range equipartition of I. Then for k ℓ, we define R(L, c, k, ℓ) as the set of all grids G = (P, Q) where Q = RLx,ℓand P P(Ly, k, [cℓ]). When k ℓ, we call the grid ΓL,c,ℓ= (RLy,cℓ, RLx,ℓ) the master range equipartition for R(L, c, k, ℓ). When k > ℓ, the set R(L, c, k, ℓ) and the master grid ΓL,c,k are defined symmetrically. The full definition of MICr then follows similarly to MICe: Definition 4.1 (MICr statistic). For any dataset D of n points with range restricted to L, c > 0, and B := B(n), let MR(L,c) D denote the range equicharacteristic matrix with (k, ℓ) th entry max G R(L,c,k,ℓ) I (D|G). Then MICr(D, L, B, c) = max k,ℓ: kℓ B(n) As with MICe, the MICr statistic can be computed in O(c2B4) time using the OPTIMIZEAXIS routine, and c can again be viewed as an approximation parameter. We also prove that MICr is still a consistent estimator of MIC* when B(n) = O(nα) for α (0, 0.5). Theorem 4.2. MICr is a consistent estimator of MIC*. Details of the computation and runtime of MICr appear in Appendix B.1, and a more formal statement of the consistency result is given by Theorem B.5 in Appendix B.2. In addition to consistency, we leverage the fact that the sets R(L, c, k, ℓ) are dataset-independent to prove an upper bound on the ℓ1 sensitivity of MICr( , L, B, c), which for datasets of size n we denote by n(MICr, L, B, c). Theorem 4.3. For any L, c > 0, B := B(n), and n 4, n(MICr, L, B, c) (4 log2 n)/n + 6/n. Compared to the sensitivity bound for MICe, the bound here loses the multiplicative B factor. This is because the sets of grids considered by MICr are fixed for all datasets of the same size, and for any G and D D , the count matrices AD,G and AD ,G can differ by at most 1 at exactly 2 entries. The proof of the theorem is developed in Appendix B.3. Equipped with the fact that MICr is both a consistent estimator of MIC* and has low sensivity, MICr is a more suitable base statistic for designing high-utility private variants. To this end, because two classical approaches to designing differentially private mechanisms (output perturbation, as with the Laplace mechanism, and input perturbation, as with private histograms (Dwork & Roth, 2014)) appear to have equal potential for achieving this task, we designed two private variants, each one following a different approach. 4.2. MICr-Lap Mechanism The first private variant is analogous to MICe-Lap, but using the smaller MICr sensitivity bound from Theorem 4.3. Mechanism 2 (MICr-Lap). For any dataset D of size n 4 restricted to L, any B := B(n), any c > 0 and any ϵ > 0, let := (4 log2 n)/n + 6/n. Then MICr-Lap(D, L, B, c, ϵ) := [MICr(D, L, B, c) + Lap( /ϵ)]0,1 . Theorem 4.4. MICr-Lap( , L, B, c, ϵ) is ϵ-DP. Again using the standard deviation of the Laplace mechanism, for any B, the standard deviation of MICr-Lap is at most ( 2/ϵ) ((4 log2 n)/n+6/n). When n=5000 and ϵ=1 this value is 0.02, which means (as we show in Section 5) that the error of the private output is likely small enough for the mechanism to be useable in practice. Also, because for fixed ϵ the standard deviation is decreasing with n, and using the consistency of MICr, it is straightforward to show that MICr-Lap is still a consistent estimator of MIC*, and we prove this in Theorem C.1 in Appendix C. 4.3. MICr-Geom Mechanism Our second private variant, MICr-Geom, adds noise during the computation of each entry in the MICr-Geom rangeequicharacteristic matrix. Recall for subpartitions G Γ that the count matrix AD,G (and thus PD,G) can be generated via the function ψ(AD,Γ, Γ, G), which doesn t depend directly on the coordinates of D. Then the k, ℓentry of the rangeequicharacteristic matrix for MICr-Geom is a maximization of I over grids G R(L, c, k, ℓ), but where the count matrix for each grid is generated via ψ( b A, Γ, G), where b A is some noisy approximation of AD,Γ and Γ := ΓL,c,ℓis the master range-equipartition for R(L, c, k, ℓ). Specifically, as established earlier, for any neighboring D D of size n, at most two corresponding entries of AD,Γ and AD ,Γ can differ, and they differ by at most 1. Thus we generate a noisy version of AD,Γ by producing independent, noisy estimates for each of its entries using the ϵ-DP Truncated Geometric mechanism of Ghosh et al. (2012), which applies to counts with sensitivity at most 1. Intuitively, the Trunc Geom(ϵ, n, f) is a doubly-geometric distribution centered at f with parameter e ϵ, and with truncation at f and n f. We summarize the distribution here: Definition 4.5 (Trunc Geom, (Ghosh et al., 2012)). For any ϵ > 0, n, and 0 f n, let Trunc Geom(ϵ, n, f) be a discrete distribution over {0, . . . , n}. Set ρ := e ϵ. Then: - Trunc Geom(ϵ, n, f) = 0 w.p. ρf/(1 + ρ). - Trunc Geom(ϵ, n, f) = i w.p. ((1 ρ)/(1 + ρ))ρ|f i| for all 1 i n 1. - Trunc Geom(ϵ, n, f) = n w.p. ρ(n f)/(1 + ρ). Differentially Private Maximal Information Coefficients We then define the MICr-Geom mechanism formally and give its privacy guarantee as follows: Mechanism 3 (MICr-Geom). Fix any dataset D of size n restricted to L, any c > 0, any B := B(n), and any ϵ > 0. 1. For every k, ℓ 2, let Γ denote the master rangeequipartition grid for R(L, c, k, ℓ). Let b A := b Aϵ D,Γ be the noisy count matrix whose (i, j) entry is given by Trunc Geom(ϵ/2, n, a(i, j)) (where a(i, j) is the corresponding entry of AD,G). Let bn be the sum of entries in b A, and let b P = (1/bn) b A. 2. For every G R(L, c, k, ℓ), let b PG := ψ(b P, Γ, G). 3. Let c MR(L,c) D,ϵ be the noisy range-equicharacteristic matrix with (k, ℓ) entry max G R(L,c,k,ℓ) I (b PG). Then: MICr-Geom(D, L, B, c, ϵ) = max k,ℓ: kℓ B(n) c MR(L,c) D,ϵ Theorem 4.6. MICr-Geom(D, L, B, c, ϵ) is ϵ-DP. With a linear-time additive preprocessing step, samples from the Trunc Geom distribution can be done in constant time, and thus the running time of computing MICr-Geom is asympotically equivalent to MICr and MICr-Lap. The details of both the privacy statement and the runtime analysis are given in Appendices D.1 and D.2 respectively. Additionally, we prove that for a fixed ϵ and c, the error introduced from using the noisy count matrices Aϵ D,Γ goes to 0 as n grows: Theorem 4.7 (Added error of MICr-Geom). Fix any α (0, 0.5), finite c > 0, ϵ > 0, and dataset D of size n. For sufficiently large n, there exists some a > 0 such that c MR(L,c) D,ϵ k,ℓ MR(L,c) D = O((c/ϵ)n a) for all kℓ B(n) = O(nα) simultaneously with probability at least 1 O(n 2). Intuitively, the dependence on c in the error bound is a result of having a master range-equiparitition size of cℓ2 (wlog when k ℓ). So choosing larger c yields a better approximation of MIC but results in a noisier b A (and thus a slower convergence to the non-private MICr). We explore this tradeoff more experimentally in Section 5. Together with the fact that MICr is a consistent estimator of MIC* (Theorem B.5), it is straightforward to prove that MICr-Geom is also a consistent estimator of MIC*. We prove this in Theorem D.4 in Appendix D.3. 5. Experimental Evaluation To investigate their utility, we evaluated the three private mechanisms on both synthetic and real datasets. We used synthetic data (following the methodology of Reshef et al. (2011; 2016)) to help better understand the error of our mechanisms at growing sample sizes over a variety of distributions designed to capture different types of relationships. Using real data helps to additionally verify the utility of our mechanisms in practice at specific, fixed sample sizes. The code and data used to obtain our experimental results can be accessed at https://github.com/ jlazarsfeld/dp-mic, and more implementation details are given in Appendix E.1. 5.1. Synthetic Data We considered the family of 21 functional relationships introduced by Reshef et al. (2011). Similar to their later work (Reshef et al., 2016), for every relationship we defined 9 joint distributions, each generated by placing k=100 independent bivariate Gaussian distributions (with zero correlation and identical variances) centered at points evenlyspaced along the function graph. The 9 distributions were parameterized by an R2 value in {0.1, 0.2, . . . , 0.9} that determined the variances of each Gaussian. The result is a set Q of 189 joint distributions bounded in range by [0, 1] [0, 1], each representing a functional relationship with varying levels of noise. For each Π Q, we computed an approximation of MIC*(Π) using the (provably convergent) method of Reshef et al. (2016). For n ranging from 25 to 10,000 and ϵ {0.1, 1.0}, we measured the accuracy of each mechanism (wrt MIC*(Π)) on datasets of n points sampled i.i.d. from Π. The full details of this dataset generation process are included in Appendix E.2. Given that MICe-Lap simply adds noise to MICe, we set its parameters to B(n) = n0.6 and c = 15, matching the suggested settings for MICe. However, because MICr-Lap and MICr-Geom use a different base statistic, we evaluated these mechanisms with varying B and c to better determine optimal parameter settings for a fixed n and ϵ. Parameter Tuning for MICr-Geom, MICr-Lap: We considered sample sizes of n {25, 250, 500, 1000, 5000, 10000}, ϵ {0.1, 1.0}, and various values of B between 4 and 150. Although the consistency guarantees for MICr-Geom and MICr-Lap are phrased in terms of B := B(n) = O(nα), defining B in absolute terms helps us better determine optimal values of B for each mechansim. For computational considerations, we fixed c = 5 for the MICr-Lap mechanism (we found the mechanism to be insensitive to larger values of c), and for the MICr-Geom mechanism we considered c {1, 2} (under the consideration that larger c could worsen accuracy). For each combination of (n, ϵ, B, c), distribution Π Q, and mechanism, we ran 50 iterations of the following process: (1) construct a dataset D by sampling n points i.i.d. from Π, and (2) run the private mechanism on D. For each mechanism, n, and ϵ, we minimized an objective function Differentially Private Maximal Information Coefficients Figure 2. Boxplots of the bias and variance of each mechanism (over 50 iterations) over all Π Q for ϵ=1 and varying n. over the (B, c) parameters that involved a weighted sum of the mechanism s average absolute error (wrt MIC*) across all distributions in Q. The objective function was designed to choose parameters that could ensure parity in error across both low and high-correlation distributions, and the exact description is given in Appendix E.2. The optimized B and c parameters for MICr-Lap and MICr-Geom are summarized in Table 2 in Appendix E.2. For both mechanisms and ϵ, the optimal B values are generally increasing with n, which aligns with the intuition that the mechanisms converge toward MIC* with larger n. The optimal values ranged from 12 to 150 for B and 1 to 5 for c. Bias/Variance Evaluation: Using the parameters from Table 2 for MICr-Lap and MICr-Geom and (B(n), c) = (n0.6, 15) for MICe-Lap, for each mechanism we compared the bias (average signed error wrt to MIC*) and variance (of the 50 private iterations for a fixed distribution) over all 189 distributions in Q as n grows. The results for ϵ=1 are summarized in Figure 2, and analogous plots for ϵ=0.1 are given in Figure 4 of Appendix E.2. In both subplots in Figure 2, for every value of n, we show boxplots of the bias (resp. variance) for each mechanism over all Π Q. Recall each box represents the interquartile range (IQR, 25 th to 75 th quantiles) of the data, the whiskers appear above and below these quantiles by an additional 1.5x of the IQR, and outlier points beyond the whiskers are plotted individually. For ϵ=1, the IQR of the bias of each mechanism generally drops with n, and this is most prominent for MICr-Lap and MICr-Geom. In particular, when n=5000, the median bias of MICr-Lap is 0.01 with min and max biases of -0.1 and 0.02, which we consider tolerably low. On the other hand, while the IQR of the bias of MICr-Geom at n=5000 is also small, the outlier points indicate that the mechanism has large negative bias on a subset of distributions in Q. Specifically, while the median and max bias of MICr-Geom at n=5000 are 0.01 and and 0.04 respectively, the min bias is -0.37. Because the MICr-Geom mechanism generates noisy counts for every cell of a master grid, we found this negative bias to occur on datasets for which the non-noisy master count matrix contains a large submatrix with mostly zero entries, and this corresponds to datasets restricted to L that have large sub-regions with no points. When ϵ=0.1, Figure 4 (Appendix E.2) shows similar trends for the bias of each mechanism, but where decreases with n are slower. Additionally, the variances of MICr-Lap and MICr-Geom are significantly smaller than for MICe-Lap. For MICr-Geom this variance is particularly low, even for small n. For example, for n=250 and ϵ=1, the median variance of MICr-Geom is 0.003 and these medians decrease for larger n. In general, we observe (especially for smaller n) a bias/variance tradeoff between MICr-Lap (less bias, more variance) and MICr-Geom. 5.2. Real Data We also evaluated the utility of the three private mechanisms on two sets of data used in the experiments of Reshef et al. (2011): the Spellman data and the Baseball data. Note that these datasets do not necessarily contain sensitive information, and these sources were chosen mainly because of their previous use by Reshef et al. (2011). Because both sets contain multiple columns, we constructed for each source a collection of datasets corresponding to different pairs of columns, and we first describe this process in more detail. Dataset Description: The Spellman data (Spellman et al., 1998; Reshef et al., 2011) contains gene expression measurements for 4381 genes in the yeast organism, where each gene has a timeseries (at common, fixed time points) of n=23 measurments. Modeling the methodology of Reshef et al. (2011), for each gene, we consider the dataset D = ([23], Ti), where Ti is the timeseries for the i th gene. For each i, we let li = | max Ti min Ti|, y0 = (min Ti) li/100, y1 = (max Ti) + li/100, and we use the range bounds Li = [0, 24] [y0, y1]. This results in a collection of m = 4381 datasets D, each of size n = 23, and we refer to this as the Spellman23 collection. Because the size of each dataset in Spellman23 is small (n=23), and given that we expect the error of our mechanisms to decrease with larger n, we also constructed a collection of higher-dimensional datasets from the Spellman source as follows: for each fixed time index t [23], let Ct denote the set of measurements for all 4381 genes at time index t. Then for each unique pair t, v [23], we constructed the dataset D = (Ct, Cv). We set global range bounds by considering the maximum and minimum values across all Ti, denoted by ℓ0 and ℓ1 respectively, and by setting x0 = ℓ0 |ℓ1 ℓ0|/100, x1 = ℓ1 + |ℓ1 ℓ0|/100 and L = [x0, x1] [x0, x1] for all i. The result is a collection Differentially Private Maximal Information Coefficients MICe-Lap MICr-Lap MICr-Geom Spellman23 0.18 (0.23) 0.14 (0.17) 0.24 (0.01) Baseball 0.30 (0.21) 0.02 (0.02) 0.06 (9e-4) Spellman4381 0.26 (0.17) -0.01 (4e-4) 0.02 (1e-4) Table 1. The median bias (average signed error wrt MICe over 100 runs) and median variance (over 100 iterations) of each private mechanism across all datasets of each collection for ϵ=1. of m = 253 datasets, each of size n = 4381. We refer to this as the Spellman4381 collection. Finally, the Baseball dataset (Reshef et al., 2011; Prospectus, Accessed March 2020) contains values of 129 performancerelated statistics for 337 players from the 2008 MLB season. Again following Reshef et al. (2011), for the i th statistic, we let Ci denote the set of corresponding values across all 337 players, and for each unique pair i, j [129], we constructed the dataset D = (Ci, Cj). We generated range bounds [x0, x1] for Ci and [y0, y1] for Cj using the same process as for each Ti in Spellman23. We then set L = [x0, x1] [y0, y1]. The result is a collection of m = 8256 datasets of size n=337, which we refer to as the Baseball collection. For the purposes of these experiments, the range bounds L for each collection of datasets were constructed manually. However, we assume in general that practitioners with domain-specific knowledge can set reasonable bounds for L without observing a particular dataset. For example, a practitioner with baseball acumen could set the range for the statistic Number of Games Played to [ 1, 183], since there are 182 games in an MLB season. Evaluation of Private Error: For each dataset in the three collections described above, we measured the error of all private mechanisms with respect to the dataset s nonprivate MICe score using parameters B(n) = n0.6 and c = 15. Although MICr shares similar theoretical properties as MICe, we primarily view MICr as a conduit for private approximations of MIC. Thus measuring private error with respect to the exisiting MICe statistic allows for a more realistic evaluation of our mechanisms utility. For every dataset, we ran 100 computations of each private mechanism for ϵ=1 and ϵ=0.1 (we mainly describe here the results for ϵ=1 but the corresponding tables and figures for ϵ=0.1 can be found in Appendix E.3). For MICr-Geom and MICr-Lap, we set the B and c parameters by linearly interpolating values from Table 2. For MICe-Lap, we set B(n) = n0.6 and c = 15 (matching the settings from the synthetic data evaluation). For each collection of datasets, Table 1 lists the median bias (average signed error wrt MICe) and median variance (over 100 iterations) of each private mechanism over all datasets Figure 3. Bias and variance boxplots for each mechanism over datasets (pairs) in the Spellman4381 collection (top) and Baseball collection (bottom) binned by non-private MICe score for ϵ=1. in the collection. For both MICr-Lap and MICr-Geom, the median bias drops significantly for the Baseball and Spellman4381 datasets compared to Spellman23. Similar to our evaluation on synthetic data, this supports the intutition that the error incurred due to the privatization of MICr (in both mechansims) decreases with n. Additionally, although MICe-Lap has relatively small median bias for Spellman23, its median average unsigned error was the largest (0.46 compared to 0.39 and 0.25 for MICr-Lap and MICr-Geom, respectively). While the median average unsigned errors for MICr-Lap and MICr-Geom on Spellman4381 (0.016 and 0.019) and on Baseball (0.097 and 0.068) are low, Table 1 shows these mechanisms still incur large error on Spellman23, which indicates that none of the private mechanisms are accurate enough for datasets as small as n=23. On the other hand, to better understand the practicality of the mechansims in the higher-dimensional regime, we further analyze their error on datasets in the Baseball and Spellman4381 collections, with a particular focus on how error varies for low and high-correlation datasets (using low and high MICe as a proxy). To this end, each dataset of the Spellman4381 collection is binned according to its non-private MICe score using interval endpoints [0, 0.1, 0.2, 0.3, 0.4, 1]. Similarly, each dataset in the Baseball collection is binned by its MICe score using interval endpoints [0, 0.2, 0.4, 0.6, 0.8, 1]. For both collections, we Differentially Private Maximal Information Coefficients analyzed the bias and variance of each mechanism across all pairs in a fixed bin. Figure 3 displays bias and variance boxplots for each mechanism using ϵ=1 for both collections, and similar plots for ϵ=0.1 are provided in Appendix E.3. For both collections, the IQRs for bias for both MICr-Lap and MICr-Geom in every bin are smaller and with medians closer to zero than those of MICe-Lap. In addition, in both collections the variances for MICr-Lap and MICr-Geom are significantly smaller across all bins compared to MICe-Lap, and in general we see a bias/variance tradeoff between MICr-Lap and MICr-Geom similar to the synthetic data. For MICr-Geom, we also notice on both collections that for datasets with lowest MICe scores, the mechanism tends to have a slightly more positive bias (medians of 0.03 and 0.09 for Spellman4381 and Baseball) compared to MICr-Lap (medians of -0.01 and 0.03). This positive bias of MICr-Geom likely occurs because generating private counts using Trunc Geom on datasets with low MICe scores (which have more uniform scatterplots), can reduce some of the noise in the dataset. In contrast, the slight negative bias of MICr-Geom and MICr-Lap on datasets with the largest MICe values (in both collections, but particularly for Baseball) is likely due to underlying approximation differences between (non-private) MICe and MICr. Also, while some outlier datasets with magnitudes of bias as large as 0.5 exist in the Baseball collection for both MICr-Lap and MICr-Geom, they are most prevalent for datasets with the smallest and largest MICe scores (left and right-most bins). This means it is less likely that the output of either mechanism on one of these datasets would suggest a highly-correlated pair (say, output above 0.5) when the true MICe score is low (or vice versa). Moreover, for the Spellman4381 collection, these slight positive and negative biases are less apparent, and the min and max biases of -0.04 and -0.002 for MICr-Lap and -0.04 and 0.06 for MICr-Geom across all bins indicates that for n as small as 4381, the error of these two mechanisms is sufficiently small to be useable in practice. 6. Related Work MIC is one of several recently proposed statistics for detecting non-linear dependencies between variables in the non-private setting. Other works that propose measures like the Randomized Dependence Coefficient (Lopez-Paz et al., 2013) and the ξn coefficient (Chatterjee, 2021) compare results directly with MIC and claim slightly improved computational properties and more intuitive dependence measurements for certain classes of distributions. However, MIC and MICe have been more thoroughly investigated (Reshef et al., 2016; 2018; 2020) and analyzed (Kinney & Atwal, 2014; Reshef et al., 2014) and has found wide use in practice, particularly in bioinformatics settings (Albanese et al., 2018; Cao et al., 2021). While no private variants of competing measures of dependence have been introduced, much work has been devoted to differentially private estimation of other statistical properties, for example mean estimation (Kamath et al., 2020), hypothesis selection (Bun et al., 2019), quantiles (Gillenwater et al., 2021), and false discovery rate control (Zhang et al., 2021). We contribute to both lines of work by introducing low-error, private variants for measuring non-linear dependencies. 7. Discussion and Future Work In practice, publishing MIC values privately requires a broad consideration of the analysis process. If multiple statistics are published, then each must be allocated some fraction of an overall privacy budget. Moreover, choosing which statistics to publish based on the data may leak information, and thus the selection process itself should be differentially private. To optimize accuracy, we recommend that the parameters B and c be optimized for the size n of the dataset of interest and the desired privacy level ϵ, though linear interpolation across precomputed values can be used for speed. The privacy semantics of the results depend on what each data point represents. For example, we obtain userlevel privacy if distinct individuals contribute each point and event-level privacy if points represent different events. There are several promising directions to improve and apply this work. We have not yet considered how to use the local sensitivity (Nissim et al., 2007) of a given dataset to improve the accuracy of a DP MIC statistic. Also, it may be possible to remove the requirement for MICr that the range of the data be known in advance by estimating it in a differentially private way. Moreover, we note that the noisy master grid of MICr-Geom is already differentially private, and so moreaccurate methods to estimate MIC from this object seem possible, potentially by exploiting the regular pattern of bias we observed experimentally (where bias decreases with non-private MICe). Acknowledgments This work was supported by the Office of Naval Research. The authors would like to thank Joan Feigenbaum for advising and supporting the early stages of this work, Argyris Oikonomou and Grigoris Velegkas for their contributions in an initial analysis of the sensitivity of MIC, and Yakir Reshef for helpful feedback on the consistency of MICe. Differentially Private Maximal Information Coefficients Albanese, D., Riccadonna, S., Donati, C., and Franceschi, P. A practical tool for maximal information coefficient analysis. Gigascience, 2018. Bun, M., Kamath, G., Steinke, T., and Wu, Z. S. Private hypothesis selection. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Cao, D., Chen, Y., Chen, J., Zhang, H., and Yuan, Z. An improved algorithm for the maximal information coefficient and its application. Royal Society open science, 2021. Chatterjee, S. A new coefficient of correlation. Journal of the American Statistical Association, 2021. Cormode, G., Jha, S., Kulkarni, T., Li, N., Srivastava, D., and Wang, T. Privacy at scale: Local differential privacy in practice. In Proceedings of the International Conference on Management of Data, 2018. Dwork, C. Differential privacy and the US census. In Proceedings of the Symposium on Principles of Database Systems (PODS), 2019. Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 2014. Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference (TCC), 2006. Ghosh, A., Roughgarden, T., and Sundararajan, M. Universally utility-maximizing privacy mechanisms. SIAM J. Comput., 2012. Gillenwater, J., Joseph, M., and Kulesza, A. Differentially private quantiles. In Proceedings of the International Conference on Machine Learning (ICML), 2021. Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J. V., Stephan, D. A., Nelson, S. F., and Craig, D. W. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density SNP genotyping microarrays. PLo S genetics, 2008. Kamath, G., Singhal, V., and Ullman, J. Private mean estimation of heavy-tailed distributions. In Proceedings of the Conference on Learning Theory (COLT), 2020. Kinney, J. B. and Atwal, G. S. Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences (PNAS), 2014. Lazarsfeld, J. and Johnson, A. Consistency of the maximal information coefficient estimator. Co RR, abs/2107.03836, 2021. Lopez-Paz, D., Hennig, P., and Sch olkopf, B. The randomized dependence coefficient. Advances in Neural Information Processing Systems (Neur IPS), 2013. Nissim, K., Raskhodnikova, S., and Smith, A. Smooth sensitivity and sampling in private data analysis. In Proceedings of the ACM Symposium on Theory of computing (STOC), 2007. Prospectus, B. Baseball prospectus statistics reports (2009), Accessed March 2020. URL www. baseballprospectus.com/sortable/. Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., Mc Vean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. Detecting novel associations in large data sets. Science, 2011. Reshef, D. N., Reshef, Y. A., Mitzenmacher, M., and Sabeti, P. C. Cleaning up the record on the maximal information coefficient and equitability. Proceedings of the National Academy of Sciences (PNAS), 2014. Reshef, D. N., Reshef, Y. A., Sabeti, P. C., and Mitzenmacher, M. An empirical study of the maximal and total information coefficients and leading measures of dependence. The Annals of Applied Statistics, 2018. Reshef, Y. A., Reshef, D. N., Finucane, H. K., Sabeti, P. C., and Mitzenmacher, M. Measuring dependence powerfully and equitably. The Journal of Machine Learning Research (JMLR), 2016. Reshef, Y. A., Reshef, D. N., Sabeti, P. C., and Mitzenmacher, M. Equitability, interval estimation, and statistical power. Statistical Science, 2020. Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., and Futcher, B. Comprehensive identification of cell cycle regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell, 1998. Wang, R., Li, Y. F., Wang, X., Tang, H., and Zhou, X. Learning your identity and disease from research papers: information leaks in genome wide association study. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), 2009. Zhang, W., Kamath, G., and Cummings, R. Paprika: Private online false discovery rate control. In Proceedings of the International Conference on Machine Learning (ICML), 2021. Differentially Private Maximal Information Coefficients A. MICe and MICe-Lap Details This section provides more details on the MICe and MICe-Lap statistics and develops the proofs of Theorem 3.1 (MICe sensitivity) and Theorem 3.2 (MICe-Lap privacy). A.1. Computing MICe We first give more details on computing MICe (which also provides the foundation for computing MICr). For any c > 0 and B := B(n), MICe can be computed efficiently using the OPTIMIZEAXIS routine of Reshef et al. (2011): for a fixed dataset D of size n and a fixed (wlog) k ℓ, OPTIMIZEAXIS takes as input a fixed column partition Γx of size ℓ, a fixed master row partition Γy of size ˆk, and simultaneously outputs the size-k subgrid Pk Γy (where k ˆk) that maximizes I (D|(Pk,Γx)) for all k min{ℓ, B(n)/ℓ} in O((ˆk)2kℓ) time. We refer the reader to Appendix 3 of Reshef et al. (2011) for exact implementation details of OPTIMIZEAXIS, which uses a dynamic-programming-based approach to perform the maximization. Using this subroutine, the runtime of computing MICe from Definition 2.3 can be stated as follows: Theorem A.1 (Reshef et al. (2016)). For any dataset D of size n, B := B(n), and c > 0, MICe(D, B, c) can be computed using OPTIMIZEAXIS in O(c2B4) time. Proof. As stated in Definition 2.3, for a fixed ℓ, for every k ℓ, the (k, ℓ) th entry of the equicharacteristic matrix uses a master row mass equipartition of size ˆk = c ℓ, for some fixed c > 0. Then for each ℓ= 2, . . . , B/2, the OPTIMIZEAXIS routine runs in O(c2ℓ2 kℓ) = O(c2ℓ2B) time, since kℓ B. In sum, running this routine for each ℓ= 2, . . . B/2 (to compute all equicharacteristic matrix entries where ℓ k) takes O(c2B4) time. By symmetry, computing all equicharacteristic matrix entries where k > ℓalso takes O(c2B4) time, and thus the theorem statement follows. A.2. MICe Sensitivity Upper Bound We now develop the proof of Theorem 3.1, which bounds the ℓ1-sensitivity of the MICe statistic. First, recall from Section 2 that for a fixed k ℓand for a fixed D of n points, E(D, c, k, [ℓ]) is the set of all grids G with k rows and ℓcolumns where the corresponding distribution D|G has mass-equipartitioned columns. Equivalently, for G E(D, c, k, [ℓ]), the count matrix AD,G is of size k ℓ, and its column sums are all equal2. For k > ℓ, the set E(D, c, [k], ℓ) is analogoulsy defined as the set of all k ℓgrids G where D|G has equipartitioned rows. Recall also from Section 2 that for a dataset D of n points, a maximum grid size parameter B := B(n), and a constant c > 0, we define MICe(D, B, c) = max k ℓ B where ME(D,c) D is the equicharacteristic matrix of D. Then for any (k, ℓ) pair, the (k, ℓ) entry of ME(D,c) D is computed by max G E(D,c,k,[ℓ]) I(D|G) log2 k if k ℓ max G E(D,c,[k],ℓ) I(D|G) log2 ℓ if k > ℓ . Fix k ℓ(which we assume without loss of generality, since the analogous arguments when k > ℓfollow by symmetry). 2In the case that n/ℓis not whole, let the first ℓ 1 columns of AD,G contain (n/ℓ) points each, and let ℓ th column contain the remainder of the points. Additionally, we assume that the coordinates of every point in D are unique, since we can perturb the position of each point by a small amount without affecting the output of the MICe statistic. Differentially Private Maximal Information Coefficients Because the set E(D, c, k, [ℓ]) depends on D, it follows that E(D, c, k, [ℓ]) = E(D , c, k, [ℓ]) for datasets D = D . This means that in (2), the values (ME(D,c) D )k,ℓand (ME(D ,c) D )k,ℓare defined as maximizations over nonequal constraint sets. To derive an upper bound on the sensitivity of MICe( , B, c), it is helpful to define these two maximizations using a common constraint set one that depends on n, but not on D or D . To this end, we will proceed by developing a new, equivalent formulation for entries in ME(D,c) D , where the constraint set depends only on n. A.2.1. ALTERNATE FORMULATION OF EQUICHARACTERISTIC MATRIX For a k ℓgrid G and a fixed dataset D of size n, recall that D|G is a joint probability distribution over the discrete space {1, . . . , k} {1, . . . , ℓ}. So the distribution can be represented by the count matrix PD,G Rk ℓ, where Since by definition P i,j (AD,G)i,j = n, we must have P i,j (PD,G)i,j = 1. For readability, when D and G are fixed and clear from context, we will drop the subscripts and write A and P. The row and column marginal distributions of D|G can also be represented in vector form. First, letting (a D,G)i, and (a D,G) ,j denote the i th row vector and j th column of AD,G respectively, we define r D,G Zk and c D,G Zℓas r D,G = ( (a D,G)1, 1, . . . , (a D,G)k, 1) c D,G = ( (a D,G) ,1 1, . . . , (a D,G) ,ℓ 1) , where r D,G 1 = c D,G 1 = n. Then the row and column marginal distributions of D|G are given by (r D,G/n) and (c D,G/n) respectively. Now observe that by definition, the mutual information I(D|G) depends only on the joint distribution D|G, and not on the individual coordinates of points in D. Additionally, because we can represent D|G using the matrix P = 1 n A, the joint distribution D|G depends only on the entries of A, and not on the absolute positions of the column and row dividers of G. It follows trivially that for a fixed dataset D of size n and two different k ℓgrids G and G , if AD,G = AD,G , then r D,G = r D,G, c D,G = c D,G, and PD,G = PD,G . Moreover, for a fixed D, if r D,G = r D,G and c D,G = c D,G then AD,G = AD,G and PD,G = PD,G (i.e., if D|G and D|G have the same marginal row and column distributions, then D|G = D|G ). Together, these two observations imply that for a fixed D and any two grids G and G , the two distributions D|G and D|G are equal if and only if r D,G = r D,G and c D,G = c D,G. We summarize these observations and equivalences with the following lemma. Lemma A.2. Fix a dataset D of n points, and fix any k and ℓ. For any grids G and G of size k ℓ, the following statements are all equivalent: D|G = D|G (i) AD,G = AD,G (ii) PD,G = PD,G (iii) r D,G = r D,G and c D,G = c D,G . (iv) Proof. Statements (i) and (ii) are equivalent by the definition of a count matrix A, and (iii) is equivalent to (ii) by the definition of the P. Statement (ii) also implies (iv) by the definition of the vectors r and c. To complete the proof, we show that (iv) implies (ii). For this, notice that for a fixed D, the entries of AD,G can be computed given the vectors r D,G and c D,G. For example, the entry (AD,G)1,1 is the number of points in D whose y-coordinate has descending rank at most r D,G(1), and whose x-coordinate has ascending rank at most c D,G(1); the entry (AD,G)1,2 is the number of points in D whose y coordinate has descending rank at least r D,G(1) + 1 and at most r D,G(1) + 1 + r D,G(2), and whose x coordinate has ascending rank at most c D,G(1), etc. Thus for a fixed D, the entries of AD,G can be computed as a function of D and the vectors r D,G and c D,G, but without a direct dependence on the positions of the column and row dividers of G. It then follows that for a fixed D and any two grids G and G of the same size, if r D,G = r D,G and c D,G = c D,G , then AD,G = AD,G , which completes the proof. Differentially Private Maximal Information Coefficients Observe that for a fixed n and k and ℓ, for every dataset D of size n, and for every G E(D, c, k, [ℓ]), the vectors c D,G Zℓ are all identical. This follows from the fact that for any D of size n, the count matrices AD,G for any G E(D, c, k, [ℓ]) have row sums that depend only on n and ℓ, and not on D. Let cn,[ℓ] Zℓdenote this unique vector. Additionally, observe that there are only finitely many vectors r Zk satisfying r 1 = n. Let R(n, k) denote the set of all such vectors where T(n, k) = |R(n, k)|, and fix some enumeration R(n, k) = {r1, r2, . . . , r T (n,k)}. So for a fixed dataset D of size n, G E(D, c, k, [ℓ]) unique i [T(n, k)] such that r D,G = ri , (3) which follows directly from the definition of r D,G. So for any i [T(n, k)], for any two k ℓgrids G and G such that r D,G = r D,G = ri, it follows by Lemma A.2 that D|G = D|G (since we also have c D,G = c D,G = cn,[ℓ]), and thus I(D|G) = I(D|G ). In the other direction, it is easy to see that i [T(n, k)] G E(D, c, k, [ℓ]) such that r D,G = ri , (4) since for any i, such a grid G can be constructed by sorting the points of D by their y coordinate and drawing a row divider between certain numbers of consecutive points as specified by ri. For a fixed dataset D of size n, and for any ri R(n, k), let P D, ri, cn,[ℓ] Rk ℓbe the probability matrix derived from the vectors ri and cn,[ℓ] (i.e., as described in the proof of Lemma A.2). Then by (4), for every i [T(n, k)], there exists some G E(D, k, [ℓ]) such that P(D, ri, cn,[ℓ]) = PD,G and thus I(P(D, ri, cn,[ℓ])) = I(D|G). Finally, by combining the preceding arguments, we state the following reformulation of entries (ME(D,c) D )k,ℓ. Lemma A.3. For any fixed dataset D of size n, and for any k ℓ: k,ℓ = max G E(D,c,k,[ℓ]) I(D|G) log2 k = max i T (n,k) I P D, ri, cn,[ℓ] log2 k . (5) Crucially, for any dataset D of size n and a fixed k ℓ, the constraint set in the new formulation of Lemma A.3 depends only on n and k, and not on D. This yields the following useful observation, which we state as a Corollary: Corollary A.4. For any n, for any 2 k ℓ, and for any two datasets D, D , both of size n: ME(D,c) D k,ℓ ME(D ,c) D max i T (n,k) I P(D, ri, cn,[ℓ]) I P(D , ri, cn,[ℓ]) . (6) Proof. By Lemma A.3, we have k,ℓ ME(D ,c) D max i T (n,k) I P(D, ri, cn,[ℓ]) log2 k max i T (n,k) I P(D , ri, cn,[ℓ]) max i T (n,k) I P(D, ri, cn,[ℓ]) max i T (n,k) I P(D , ri, cn,[ℓ]) , where the inequality follows from factoring out the common log2 k term and noting that log2 k log2 2 = 1 by assumption. Given that the constraint sets in each maximization term are equal, we then have max i T (n,k) I P(D, ri, cn,[ℓ]) max i T (n,k) I P(D , ri, cn,[ℓ]) max i T (n,k) I P(D, ri, cn,[ℓ]) I P(D , ri, cn,[ℓ]) which completes the proof. Differentially Private Maximal Information Coefficients A.2.2. SENSITIVITY OF MICe VIA SENSITIVITY OF MUTUAL INFORMATION Using the new formulation from Lemma A.3 we will derive an upper bound on the sensitivity of MICe by deriving an upper bound on the sensitivity of mutual information with respect to a fixed i T(n, k). For this, note that for any pair of neighboring datasets D D of size n, and for any B := B(n) and c > 0: | MICe(D, B, c) MICe(D , B, c) | = max k,ℓ: k ℓ B k,ℓ max k,ℓ: k ℓ B max k,ℓ: k ℓ B k,ℓ ME(D ,c) D Now by Corollary A.4, we know (assuming wlog that k ℓ) that ME(D,c) D k,ℓ ME(D ,c) D max i T (n,k) I P(D, ri, cn,[ℓ]) I P(D , ri, cn,[ℓ]) . (8) We further claim the following bound: Lemma A.5. For any n 6 and any 2 k ℓ, and any neighboring datasets D D both of size n: max i T (n,k) I P(D, ri, cn,[ℓ]) I P(D , ri, cn,[ℓ]) kℓ 2 log2 n Let us grant Lemma A.5 as true for now. The proof of Theorem 3.1 then follows easily: Proof (of Theorem 3.1). Applying Lemma A.5 to (8) and substituting into (7) gives us | MICe(D, B, c) MICe(D , B, c) | max k,ℓ: k ℓ B kℓ 2 log2 n which holds for any D D of size n. Observing that k ℓ B completes the proof. It remains to prove Lemma A.5, which we proceed to do in the following subsection. A.2.3. PROOF OF LEMMA A.5 Fix any n 6 and any 2 k ℓ, and consider any fixed pair of neighboring datasets D D , both of size n. Fix any ri R(n, k), and let P := P(D, ri, cn,[ℓ]) with entries p(i, j) and column sums p( , j) P := P(D , ri, cn,[ℓ]) with entries p (i, j) and column sums p ( , j) . Let A and A denote the corresponding count matrices for P and P , respectively. Because P and P are non-negative matrices representing two-dimensional, discrete joint distributions, we define information-theoretic properties of P and P in the natural way. Specifically, let I(P) = Hx(P) Hx|y(P) j [ℓ] p( , j) log2 Hx|y(P) = X j [ℓ] p(i, j) log2 The analogous definitions for P are defined in terms of p (i, j) and p ( , j). Differentially Private Maximal Information Coefficients Our goal is to give an upper bound on |I(P) I(P )|, which we can write as |I(P) I(P )| = Hx(P) Hx|y(P) Hx(P ) Hx|y(P ) = Hx(P) Hx(P ) + Hx|y(P ) Hx|y(P) . Because P and P are both generated using cn,[ℓ], it follows by definition that p( , j) = p ( , j) for all j [ℓ], and thus Hx(P) = Hx(P ). Then |I(P) I(P )| = Hx|y(P ) Hx|y(P) j [ℓ] p (i, j) log2 j [ℓ] p(i, j) log2 p (i, j) log2 p(i, j) log2 j [ℓ] | p (i, j) log2 p ( , j) p (i, j) log2 p (i, j) p(i, j) log2 p( , j) + p(i, j) log2 p(i, j) | j [ℓ] | (p (i, j) p(i, j)) log2 p( , j) + p(i, j) log2 p(i, j) p (i, j) log2 p (i, j) | . (10) Here, the third line is due to the triangle inequality, and the final line is due to p( , j) = p ( , j) for all j. For any (i, j) [k] [ℓ], let α(i, j) = | (p (i, j) p(i, j)) log2 p( , j) + p(i, j) log2 p(i, j) p (i, j) log2 p (i, j) | . The following lemma gives a uniform upper bound on α(i, j) for any (i, j): Lemma A.6. For any n 6, any 2 k ℓ, and any P and P as described above, α(i, j) 2 log2 n for any (i, j) [k] [ℓ]. Granting this lemma true for now and applying it to expression (10) then implies | I(P) I(P ) | X n kℓ 2 log2 n Because we have considered an arbitrary vector ri R(n, k), the statement of Lemma A.5 follows. Thus it only remains to prove Lemma A.6. Proof (of Lemma A.6). For a fixed grid G E(D, c, k, [ℓ]), let φx and φ x denote the x-coordinate point mapping functions wrt G for D and D respectively (i.e., for d D, the function φx(d) returns the column partition index of d wrt D|G). Then observe that for any (i, j) [k] [ℓ], we have | p(i, j) p (i, j) | 2/n . This is because for every column j [ℓ], there is at most one point d D where φx(d) = j but φ x(d) {j 1, j + 1} (equivalently, at most one point d D where φ x(d ) {j 1, j + 1} but φx(d) = j), The same argument holds for every row i [k]. Thus A and A differ by at most 2 at any entry (i, j), and so |p(i, j) p (i, j)| 2/n for all (i, j). Using this fact, we will derive an upper bound on α(i, j) for any fixed (i, j) via three cases: (1) when p(i, j) = p (i, j), (2) when p(i, j) > p (i, j), and (3) when p(i, j) < p (i, j). For ease of readability, and when (i, j) is fixed and clear from context, we we will write p := p(i, j) and p := p (i, j). Thus for fixed (i, j) we seek to bound α = | (p p) log2 p( , j) + p log2 p p log2 p | . We have the following cases: Differentially Private Maximal Information Coefficients 1. Case when p = p : We trivially have that α = 0. 2. Case when p > p : (a) Case when p p = 1/n: Note that p p = 1/n implies p( , j) p 1/n. If p = 1/n (and p = 0), then n log2 p( , j) + 1 = 1 n log2 1/n p( , j) = 1 n log2 p( , j) Since 1/n p( , j) 1, it follows that α (log2 n)/n. On the other hand, say p 2/n. Then n log2 p( , j) + p log2 p (p 1/n) log2(p 1/n) = p log2 p p 1/n + 1 n log2 p 1/n p log2 p p 1/n + 1 n log2 p 1/n where the last line is due to the triangle inequality. Observe that the left hand term in (11) is decreasing in p, and since p 2/n by assumption: p log2 p p 1/n 2 n log2 2/n 2/n 1/n For the right hand term in (11), note that p 1/n = p p( , j) (since the joint mass in any cell (i, j) is at most the marginal mass of that cell s column), so this term is also decreasing in p. Again since p 2/n by assumption, we have 1 n log2 p 1/n 1 n log2 2/n 1/n = 1 n log2 (n p( , j)) log2 n where the final inequality holds because 2/n p( , j) 1. Substituting (12) and (13) back into (11) gives α (2/n) + (log2 n)/n when p 2/n. Thus this bound holds for Case (2a) in general. (b) Case when p p = 2/n: The condition p p = 2/n implies p( , j) p 2/n. First consider when p = 2/n and p = 0. Then n log2 p( , j) + 2 = 2 n log2 p( , j) = 2 n log2 n 2 where the penultimate inequality follows from the assumption that 2/n p( , j) 1. On the other hand, consider when p 3/n. Then n log2 p( , j) + p log2 p (p 2/n) log2(p 2/n) = p log2 p p 2/n + 2 n log2 p 2/n p log2 p p 2/n + 2 n log2 p 2/n Differentially Private Maximal Information Coefficients As in Case (2a), the left hand term in (14) is decreasing in p 3/n, meaning p log2 p p 2/n 3 n log2 3/n 3/n 2/n n log2 3 4.8 Since p p( , j), the right hand term in (14) is also decreasing in p 3/n. So 2 n log2 p 2/n 2 n log2 3/n 2/n 2 n log2(n p( , j)) 2 log2 n where the finaly inequality holds since p( , j) 1. Substituting (15) and (16) back into (14) gives α (4.8/n) + (2 log2 n)/n, which holds in general for Case (2b). 3. Case when p > p: (a) Case when p p = 1/n: This condition implies 0 p 1 1/n. First consider when p = 0 and p = 1/n. Then α = 1 n log2 p( , j) 1 = 1 n log2 p( , j) where the final inequality is due to 1/n = p p( , j) 1. Now consider when p 1/n. Then α = 1 n log2 p( , j) + p log2 p (p + 1/n) log2(p + 1/n) = p log2 p p + 1/n + 1 n log2 p( , j) p + 1/n p log2 p p + 1/n + 1 n log2 p( , j) p + 1/n The left hand term of (17) is increasing with p, and by the assumption that p 1 1/n, p log2 p p + 1/n = log2 n n 1 Because (n + 2)/n n/(n 1) for all n 2, we have p log2 p p + 1/n = | log2(1 + 2/n) | . Note that because e 2(3/2), then 1 + x ex 2(3/2)x for all x. This implies log2(1 + x) (3/2)x for all x, and so p log2 p p + 1/n | log2(1 + 2/n) | 3 The right hand term of (17) is decreasing in p since p + 1/n = p p( , j). Since p 1/n by assumption, 1 n log2 p( , j) p + 1/n 1 n log2 p( , j) 1/n + 1/n = 1 n log2 ((n/2) p( , j)) log2(n/2) where the penultimate inequality holds because p( , j) 1. Substituting (18) and (19) into (17) gives α (3/n) + (log2 n)/n, which holds in general for Case (3a). Differentially Private Maximal Information Coefficients (b) Case when p p = 2/n: This condition implies 0 p 1 2/n. When p = 0 and p = 2/n, we have α = 2 n log2 p( , j) 2 = 2 n log2 p( , j) 2 n log2(n/2) 2 log2 n where the penultimate inequality is due to 2/n = p p( , j) 1. Now consider when p 1/n. We have α = 2 n log2 p( , j) + p log2 p (p + 2/n) log2(p + 2/n) = p log2 p p + 2/n + 2 n log2 p( , j) p + 2/n p log2 p p + 2/n + 2 n log2 p( , j) p + 2/n Similar to Case (3a), the left hand term of (20) is increasing with p, and by the assumption that p 1 2/n, p log2 p p + 2/n = log2 n n 2 Because (n + 3)/n n/(n 2) for all n 6, we have p log2 p p + 2/n = | log2(1 + 3/n) | . Then applying the identity log2(1 + x) (3/2)x yields p log2 p p + 2/n | log2(1 + 3/n) | 3 Now the right hand term of (20) is decreasing in p since p + 2/n = p p( , j). So by the assumption p 1/n, 2 n log2 p( , j) p + 2/n 2 n log2 p( , j) 1/n + 2/n 2 n log2((n/3) p( , j)) 2 log2(n/3) n 2 log2(n) where the penultimate inequality is due to p( , j) 1. Now substituting expressions (21) and (22) into (20) gives the bound α (4.5/n) + (2 logn)/n, which holds in general for Case (3b). To summarize the bounds on α for each of the cases: Case when p = p : α = 0. Cases when p = p : |p p | = 1/n |p p | = 2/n p > p Case (2a): α (log2 n)/n + 2/n Case (2b): α (2 log2 n)/n + 4.8/n p > p Case (3a): α (log2 n)/n + 3/n Case (3b): α (2 log2 n)/n + 4.5/n It follows that for any (i, j), which completes the proof. Differentially Private Maximal Information Coefficients A.3. MICe-Lap Privacy Here we prove the privacy gaurantee of the MICe-Lap mechanism from Theorem 3.2. This requires first stating the post-processing principle, which says that the result of performaing any additional computation on the output of an ϵ-DP mechanism is still private. Theorem A.7 (DP post-processing, Dwork & Roth (2014)). Let A : X Y be an ϵ-DP mechanism. Then for any function f : Y Z, the composition (f A) : X Z is also ϵ-DP. The privacy of the MICe-Lap mechanism then follows directly: Theorem 3.2. MICe-Lap( , B, c, ϵ) is ϵ-DP. Proof. This follows by the ϵ-DP privacy of the Laplace mechanism, and by applying the post-processing property from Theorem A.7 with truncation to the [0, 1] interval. B. MICr Details In this section, we give more details on the MICr statistic and develop the proofs of Theorem 4.2 (MICr consistency) and Theorem 4.3 (MICr sensitivity). Some of the notation and definitions were introduced previously in Sections 2 and 4, but we will restate them here for completeness and readability. B.1. Computing MICr B.1.1. THE SET OF GRIDS R(L, c, k, ℓ) Recall for an interval I that RI,ℓis a size-ℓrange-equipartition of I, and P(I, k, [cℓ]) is the set of all size-k subpartitions of a size-(cℓ) range-equipartition of I. Then for fixed range bounds L = Lx Ly and any finite c > 0: for any 2 k ℓ, the set R(L, c, k, [ℓ]) contains all grids G = (P, Q) where Q = RLx,ℓand P P(Ly, k, [cℓ]). for any 2 ℓ< k, the set R(L, c, [k], ℓ) contains all grids G = (P, Q) where Q P(Lx, ℓ, [ck]) and P = RLy,k. For any L and c, and any k, ℓ 2, we overload the defintion of R and write R(L, c, k, ℓ) = R(L, c, k, [ℓ]) if k ℓ R(L, c, [k], ℓ) if k > ℓ. (23) B.1.2. COMPUTING MICr ON FINITE DATASETS Recall now that for a fixed L and c, and for any dataset D of size n with range restricted to L, let MR(L,c) D denote the sample range-equicharacteristic matrix for that dataset. For any k, ℓ 2, its entries are given by k,ℓ = max G R(L,c,k,ℓ) I(D|G) log2 min{k, ℓ} . (24) Then for a maximum grid parameter B := B(n), we define MICr(D, B, L, c) = max k,ℓ: k ℓ B Note that because MICr and MICe differ only in the definition of the respective master grids, but not in their size, MICr can be computed in time asympotically equivalent to MICe again using the OPTIMIZEAXIS routine described in Appendix A.1. Stated formally (and analogous to Theorem A.1): Theorem B.1. For any dataset D of size n restricted to range L, B := B(n), and c > 0, MICr(D, L, B, c) can be computed using OPTIMIZEAXIS in O(c2B4) time. Differentially Private Maximal Information Coefficients B.1.3. COMPUTING MICr ON JOINT DISTRIBUTIONS Because the set R(L, c, k, ℓ) depends only on L and c, and not on the dataset D, we can define a range-equicharacteristic matrix for joint distributions analogously to (24). Again for a fixed L and c, and for any joint distribution (X, Y ) with range restricted to L, let MR(L,c) (X,Y ) denote the range-equicharacteristic matrix for that distribution. For any k, ℓ 2, its entries are given by MR(L,c) (X,Y ) k,ℓ = max G R(L,c,k,ℓ) I((X, Y )|G) log2 min{k, ℓ} . (26) B.1.4. THE UNCONSTRAINED MICR STATISTIC For theoretical purposes, we also define the following unconstrained variant of the MICr statistic with equicharacteristic matrix MR(L, ) D . Compared to R(L, c, k, ℓ), the set R(L, , k, ℓ) (wlog when k ℓ) contains grids G with any row partition of [y0, y1] of size k (i.e., not just those restricted to subpartitions of some finite-size master range-equipartition). Thus R(L, , k, ℓ) is exactly the set of grids G obtained from R(L, c, k, ℓ) when taking c . More formally, for an interval I, let P(I, k) denote the set of all partitions of I of size k. And recall that the grid RI,ℓis a range equipartion of I of size ℓ. Then for a fixed L = Lx Ly: for 2 k ℓ, the set R(L, , k, [ℓ]) contains all grids G = (P, Q) where Q = RLx,ℓand P P(Ly, k). for any 2 ℓ< k, the set R(L, , [k], ℓ) contains all grids G = (P, Q) where Q P(Lx, ℓ) and P = RLy,k. The set R(L, , k, ℓ) is then defined analogously to R(L, c, k, ℓ) from (23). For any dataset D of size n restricted to the range L = Lx Ly, let MR(L, ) D denote its unconstrained rangeequicharacteristic matrix. For k, ℓ 2, it entries are given by MR(L, ) D k,ℓ= max G R(L, ,k,ℓ) I(D|G) log2 min{k, ℓ} . (27) Then for a maximum grid size parameter B = B(n), define MICr(D, L, B, c) = max k ℓ B For a jointly-distributed pair of random variables Π = (X, Y ) with range restricted to L, we similarly define the matrix MR(L, ) Π with entries MR(L, ) Π k,ℓ= max G R(L, ,k,ℓ) I(Π|G) log2 min{k, ℓ} . (28) B.2. MICr Consistency Using the notation introduced in the previous section, we now develop the proof of Theorem 4.2, which shows that the MICr( , L, B, c) statistic is a consistent estimator of MIC*. However, our first step is to show that the unconstrained MICr( , L, B, ) statistic is a consistent estimator of MIC*. Once this step is established, we leverage the relationship between entries of MR(L,c) and MR(L, ) to prove the consistency of MICr( , L, B, c). B.2.1. CONSISTENCY OF THE MICr( , L, B, ) ESTIMATOR We will prove the following Theorem: Theorem B.2. Let Π = (X, Y ) be a jointly-distributed pair of random variables with range bounded by L, and let Dn be a dataset of n points sampled i.i.d. from Π. For every 0 < α < 0.5, and for every ω(1) B(n) = O(nα), MICr(Dn, L, B(n), ) MIC*(Π) in probability as n . Differentially Private Maximal Information Coefficients The proof of this theorem uses the following two lemmas, which use arguments adapted from Reshef et al. (2016) and Lazarsfeld & Johnson (2021), respectively. The first lemma shows that, in the distributional setting, the supremum of the range-equicharacteristic matrix MR(L, ) Π and the standard characteristic matrix MG Π are equal, the latter of which is equal to MIC*(Π) by definition. Lemma B.3. Let Π = (X, Y ) be a pair of jointly-distributed random variables with range bounded by L = Lx Ly. Then sup MR(L, ) Π = sup MG Π = MIC*(Π) . Proof. The statement follows by an argument identical to the proof of Theorem 21 from (Reshef et al., 2016). In particular, the mutual information of I(X, Y ) is the supremum of I(X|Q, Y ) over all finite partitions Q. By definition, the (k, ℓ) th entry of MR(L, ) π uses the column partition RLx,ℓ, which is a range-equipartition of Lx of size ℓ, to partition X. Thus taking ℓin the limit, I(X|RLx,ℓ, Y ) = I(X, Y ). The remainder of the argument in the original proof carries over unchanged. The next lemma shows that, when n is sufficiently large, for a dataset Dn of n points sampled i.i.d. from Π, a subset of corresponding entries (k, ℓ) in MR(L, ) Dn and MR(L, ) Π have similar values with high probability. Lemma B.4. Let Π = (X, Y ) be a pair of jointly-distributed random variables with range bounded by L, and let Dn be a dataset of n points sampled i.i.d. from Π. For all n, and for every 0 < α < 0.5, there exists a constant u > 0 such that MR(L, ) Dn k,ℓ MR(L, ) Π simultaneously for every kℓ B(n) = O(nα) with probability at least 1 O(n 1.5). Proof. Because the the (k, ℓ) entries of MR(L, ) Π and MR(L, ) Dn are both maximizations over the same set of grids R(L, , k, ℓ), the proof follows identically to that of Theorem 5 of Lazarsfeld & Johnson (2021). The proof of Theorem B.2 then follows. Proof (of Theorem B.2). The proof is identical to that of Theorem 6 from Reshef et al. (2016). The exact argument goes through so long as (1), sup MR(L, ) Π = sup MG Π, and (2), the (k, ℓ) th entry of MR(L, ) Dn converges in probability to the (k, ℓ) th entry of MR(L, ) Dn as n for all kℓ B(n) = O(nα). These arguments follow from Lemmas B.3 and B.4, respectively. B.2.2. CONSISTENCY OF THE MICr( , L, B, c) ESTIMATOR We now use the consistency of MICr( , L, B, ) to prove the consistency of MICr( , L, B, c) for finite c. This was stated informally in Theorem 4.2 and is presented in more detail here. Theorem B.5. Let Π = (X, Y ) be a joint distribution with range bounded by L, and let Dn be a dataset of n points sampled i.i.d. from Π. For every finite c > 0, α (0, 0.5), and ω(1) B(n) = O(nα), the statistic MICr(Dn, L, B, c) MIC*(Π) in probability as n . To prove this theorem, we use (in addition to Theorem B.2) the following two analogs to Lemmas B.3 and B.4. Lemma B.6. Let Π = (X, Y ) be a pair of jointly-distributed random variables with range bounded by L = Lx Ly. Then for finite c > 0: sup MR(L,c) Π = sup MG Π = MIC*(Π) . Proof. The proof follows identically to that of Proposition 8 from Reshef et al. (2016), which gives an analogous argument for the MICe(Π, c) case. The key observation is that, wlog for k ℓ, the (k, ℓ) th entry of MR(L,c) Π is a maximization over row partitions P P(Ly, k, [cℓ]), which recall is the set of all size-k sub-partitions of a size-(cℓ) master range-equipartition of Ly. So as ℓ , the set P(Ly, k, [cℓ]) approaches P(Ly, k). The remainder of the original argument then follows identically. Differentially Private Maximal Information Coefficients Lemma B.7. Let Π = (X, Y ) be a pair of jointly-distributed random variables with range bounded by L, fix a finite c > 0, and let Dn be a dataset of n points sampled i.i.d. from Π. For all n, and for every 0 < α < 0.5, there exists a constant u > 0 such that MR(L,c) Dn k,ℓ MR(L,c) Π simultaneously for every kℓ B(n) = O(nα) with probability at least 1 O(n 1.5). Proof. Similar to Lemma B.4, because the (k, ℓ) entries of MR(L,c) Π and MR(L,c) Dn are both maximizations over the same set of grids R(L, c, k, ℓ), the proof follows identically to that of Theorem 5 of Lazarsfeld & Johnson (2021). In addition to these two lemmas, we state the following additional inequality that relates corresponding entries of MR(L,c) D and MR(L, ) D for some dataset D and finite c. Lemma B.8. Fix any dataset D with range restricted to L and any finite c > 0. Then for all k, ℓ 2 MR(L,c) D k,ℓ MR(L, ) D Proof. By definition, for any finite c > 0 and any k, ℓ 2 the set R(L, c, k, ℓ) is a subset of R(L, , k, ℓ). For a fixed dataset D, the (k, ℓ) entries of MR(L,c) D and MR(L, ) D are maximizations of a common function over the sets R(L, c, k, ℓ) and R(L, , k, ℓ), respectively. Increasing the size of the constraint set can never decrease the maximum function value subject to the constraints, so the statement of the lemma follows. Using these lemmas, we are now ready to prove Theorem B.5, which we do via a more direct adaptation of the proof of Lemma H.4 from Reshef et al. (2016). Proof (of Theorem B.5). For any 0 < α < 0.5, fix B(n) = O(nα). Our goal is to show that for a dataset Dn of n points sampled i.i.d. from Π, that MICr(Dn, L, B(n), c) MIC*(Π) in probability as n . Recall that by definition, MIC*(Π) = sup MG Π, so this means that for every ϵ > 0 and every 0 < p 1, we must show that there exists some N such that Pr max k,ℓ: kℓ B(n) k,ℓ sup MG Π < ϵ > 1 p (30) for all n N. Fix ϵ and p. By Lemma B.6, we know sup M R(L,c) Π = sup MG Π. So by definition, there must exist an entry (k, ℓ) such that MR(L,c) Π k,ℓ sup MG Π < (ϵ/2) . (31) Denote such an entry by (kϵ, ℓϵ). Now by Lemma B.7, there exists some Nϵ such that, for all n Nϵ: Pr MR(L,c) Dn kϵ,ℓϵ MR(L,c) Π < (ϵ/2) > 1 p . (32) So with probability greater than 1 p, for all n Nϵ, MR(L,c) Dn kϵ,ℓϵ sup MG Π kϵ,ℓϵ MR(L,c) Π + MR(L,c) Π kϵ,ℓϵ sup MG Π < (ϵ/2) + (ϵ/2) Differentially Private Maximal Information Coefficients Here, the first line is due to the triangle inequality, and the second line is due to the bounds from (31) and (32). Now let Nα be some integer such that, B(Nα) kϵ ℓϵ. Since B(n) = O(nα) is increasing in n, this implies that kϵ ℓϵ B(Nα) B(n) for all n Nα, which means kϵ,ℓϵ max k,ℓ: kℓ B(n) Then for all n max{Nϵ, Nα}, it follows that max k,ℓ: kℓ B(n) k,ℓ MR(L,c) Dn kϵ,ℓϵ > sup MG Π ϵ (35) with probability greater than 1 p, where the first inequality holds for n Nα by (34), and the second inequality holds for n Nϵ by (33). This implies one direction of our goal from (30). For the other direction, consider any k, ℓ 2. By Lemma B.8, we have for any n and fixed Dn that MR(L,c) Dn k,ℓ MR(L, ) Dn This implies that max k,ℓ: k ℓ B(n) k,ℓ max k,ℓ: k ℓ B(n) Now by Theorem B.2, we know that max k,ℓ: k ℓ B(n) k,ℓ sup MG Π in probability as n . So there must exist some N such that for all n N max k,ℓ: k ℓ B(n) k,ℓ sup MG Π = max k,ℓ: k ℓ B(n) k,ℓ< sup MG Π + ϵ (37) with probability greater than 1 p. Then combining (36) and (37) says that for all n N , with probability greater than 1 p: max k,ℓ: k ℓ B(n) k,ℓ max k,ℓ: k ℓ B(n) k,ℓ < sup MG Π + ϵ . (38) This implies the second direction of our goal from (30). Setting N max{Nϵ, Nα, N } ensures that for all n N, expression (30) always holds, which completes the proof. B.3. MICr Sensitivity In this section, we develop the proof of Theorem 4.3, which gives an upper bound on the sensitivity of the MICr statistic. The high-level strategy is to reduce the sensitivity of the statistic to the sensitivity of mutual information with respect to a fixed grid, similar to the arguments developed in Section A.2.2 for the bound on the MICe sensitivity. For a fixed dataset D of n points, a set of range limits L, a constant c > 0, and a maximum grid parameter B := B(n), recall that MICr(D, B, L, c) = MR(L,c) D where MR(L,c) D is the range-equicharacteristic matrix for D. Differentially Private Maximal Information Coefficients Now for any pair of neighboring datasets D D of size n and fixed B := B(n), L, and c: | MICr(D, B, L, c) MICr(D , B, L, c) | = max k,ℓ: k ℓ B k,ℓ max k,ℓ: k ℓ B max k,ℓ: k ℓ B k,ℓ MR(L,c) D As before, we will assume wlog that k ℓ, as the corresponding arguments for the k > ℓcase follow by symmetry. So by the definition from (23), for any fixed 2 k ℓ, observe that MR(L,c) D k,ℓ MR(L,c) D = max G R(L,c,k,[ℓ]) I (D|G) log2 k max G R(L,c,k,[ℓ]) I (D |G) max G R(L,c,k,[ℓ]) I (D|G) max G R(L,c,k,[ℓ]) I (D |G) max G R(L,c,k,[ℓ]) | I (D|G) I (D |G) | . (41) Here, the first inequality is due to factoring out the log2 k term and noting that log2 k 1 when k 2, and the final inequality follows from the fact that the set R(L, c, k, [ℓ]) is the same in both maximization terms. We then claim the following upper bound on (41) that holds for any D D of size n: Lemma B.9. For any n 4 and any set of range bounds L, consider any pair of neighboring datasets D D , both of size n, and fix a pair 2 k ℓ. Then for any grid G R(L, c, k, [ℓ]): | I(D|G) I(D |G) | 4 log2 n Observe that this lemma holds for any G R(L, c, k, [ℓ]), and thus it holds for the maximizing grid in (41). Granting the lemma true for now then implies the upper bound on the MICr statistic (i.e., the proof of Theorem 4.3): Proof (of Theorem 4.3). Apply Lemma B.9 to (41). Subtituting back into (40) implies that for any D D of size n 4 and any B: | MICr(D, B, L, c) MICr(D , B, L, c) | max k,ℓ: k ℓ B 4 log2 n n = 4 log2 n The final equality follows by observing that the term from Lemma B.9 does not depend on k or ℓ. It now remains to prove Lemma B.9. B.3.1. PROOF OF LEMMA B.9 The proof follows similarly to that of Lemma A.5. Fix any n 4, any 2 k ℓ, and any set of range boundaries L, and consider any pair of neighboring datasets D D , both of size n. Now consider any grid G R(L, c, k, [ℓ]). Let P := PD,G with entries p(i, j), row sums p(i, ), and column sums p( , j) P := PD ,G with entries p (i, j), row sums p (i, ), and column sums p ( , j) , and let A and A denote the corresponding count matrices for P and P respectively. Because P and P are non-negative matrices representing two-dimensional, discrete joint distributions, we define the mutual information of P and P in the natural way. Specifically, let j [ℓ] p(i, j) log2 p(i, j) p(i, )p( , j) Differentially Private Maximal Information Coefficients Then to upper bound |I(P) I(P )|, we have | I(P) I(P ) | X p(i, j) log2 p(i, j) p(i, )p( , j) p (i, j) log2 p (i, j) p (i, )p ( , j) which follows from the triangle inequality. For any (i, j) [k] [ℓ], let γ(i, j) = p(i, j) log2 p(i, j) p(i, )p( , j) p (i, j) log2 p (i, j) p (i, )p ( , j) Now because G is fixed with respect to both D and D , observe that there at most two entries (i, j) such that p(i, j) = p (i, j), and thus at most two pairs (i, j) such that γ(i, j) > 0. Specifically for any D D , there are only two scenarios regarding how the entries of P and P differ: 1. Scenario A: there is at most 1 pair of points d D and d D with nonequal coordinates, but φ(d, G) = φ(d , G). Then for all (i, j) [k] [ℓ], p(i, j) = p (i, j) and thus γ(i, j) = 0. By (42), this means |I(P) I(P )| = 0 . 2. Scenario B: there is at most 1 pair of points d D and d D with nonequal coordinates, and φ(d, G) = φ(d , G). Then there is exactly one (i, j) [k] [ℓ] such that p (i, j) = p(i, j) 1/n and thus γ(i, j) 0. For this unique (i, j), we denote γ(i, j) by γ . There is also exactly one (i, j) [k] [ℓ] such that p (i, j) = p(i, j) + 1/n and thus γ(i, j) 0. For this unique (i, j), we similarly denote γ(i, j) by γ+. By (42), this means that |I(P) I(P )| γ+ + γ . (43) Intuitively, γ corresponds to the entry (i, j) that loses a point in A compared to the corresponding entry in A. Similarly, γ+ corresponds to the entry (i, j) that gains a point in A compared to the correpsonding entry in A. Because |D| = |D | = n, both such entries must exist in this scenario. Let us assume a D D and G from Scenario B (otherwise the lemma follows trivially). We claim the following bound on γ : Claim B.10. For any n 4 and for P and P as described above, let (i, j) [k] [ℓ] be the cell such that p (i, j) = p(i, j) 1 Then γ := γ(i, j) 2 log2 n For γ+, we have a similar claim: Claim B.11. For any n 4 and for P and P as described above, let (i, j) [k] [ℓ] be the cell such that p (i, j) = p(i, j) + 1 Then γ+ := γ(i, j) 2 log2 n Differentially Private Maximal Information Coefficients If we grant these claims true for now, then by expression (43) it follows that | I(P) I(P ) | 2 log2 n n + 2 log2 n n = 4 log2 n which is the statement of the lemma. So it only remains to prove the two claims. Proof (of Claim B.10). For readability, let p := p(i, j) and p := p (i, j). Since we assume p = p 1/n, we have that p(i, ) 1/n p (i, ) p(i, ) p( , j) 1/n p ( , j) p( , j) . Moreover, we have γ = p log2 p p(i, )p( , j) (p 1/n) log2 p 1/n p (i, )p ( , j) Observe that when p = 1/n and p = 0, then γ = 1 n log2 1/n p(i, )p( , j) When p = 1/n, then 1/n p(i, ), p( , j) 1 (and thus 1/n2 p(i, ) p( , j) 1), so 1 n log2 1/n p(i, )p( , j) n log2 1/n 1/n2 , 1 n log2 1 1/n On the other hand, consider when p 2/n and 1/n p 1 1/n. We then have p p 1/n p (i, )p ( , j) p(i, )p( , j) n log2 p 1/n p (i, )p ( , j) p p 1/n p (i, )p ( , j) p(i, )p( , j) + 1 n log2 p 1/n p (i, )p ( , j) where we use the triangle inequality. For the left hand term in (44), say that p p 1/n p (i, )p ( , j) p(i, )p( , j) > 1. Then because p (i, ) p ( , j) p(i, ) p( , j), p log2 p p 1/n p (i, )p ( , j) p(i, )p( , j) 2/n 2/n 1/n Here, the final inequality is because the function is decreasing in p, and p 2/n by assumption. On the other hand, consider when p p 1/n p (i, )p ( , j) p(i, )p( , j) < 1. Then because p p(i, ), p( , j) 1, and because p (i, ) p ( , j) (p(i, ) 1/n) (p( , j) 1/n): p log2 p p 1/n p (i, )p ( , j) p(i, )p( , j) p p(i, )p( , j) p (i, )p ( , j) p(i, )p( , j) (p(i, ) 1/n)(p( , j) 1/n) p(i, ) log2 p(i, ) p(i, ) 1/n + p( , j) log2 p( , j) p( , j) 1/n Differentially Private Maximal Information Coefficients Here, the two terms are decreasing in p(i, ) and p( , j), respectively. Since p(i, ), p( , j) p 2/n by assumption, we have p(i, ) log2 p(i, ) p(i, ) 1/n + p( , j) log2 p( , j) p( , j) 1/n 2/n 2/n 1/n 2/n 2/n 1/n which bounds the left hand term of (44) in general. For the right hand term of (44), if 1 n p 1/n p (i, )p ( , j) < 1, then 1 n log2 p 1/n p (i, )p ( , j) = 1 n log2 p (i, )p ( , j) 1 n log2 1 2/n 1/n Here, the penultimate inequality is due to p (i, ) p ( , j) 1 and p 2/n by assumption. On the other hand, consider when 1 n p 1/n p (i, )p ( , j) > 1 . Then for the right hand term of (44), we have 1 n log2 p 1/n p (i, )p ( , j) 1 n log2 p 1/n (p(i, ) 1/n)(p( , j) 1/n) Now because p min{p(i, ), p( , j)}, then 1 n log2 p 1/n p (i, )p ( , j) max 1 n log2 1 p(i, ) 1/n , 1 n log2 1 p( , j) 1/n 1 n log2 1 p(i, ) 1/n + 1 n log2 1 p( , j) 1/n where the inequality is due to max{x, y} x + y for all x, y 0. Because the terms in the sum are decreasing in p(i, ) and p( , j) respectively, and because p(i, ), p( , j) 2/n by assumption, we have 1 n log2 1 p(i, ) 1/n + 1 n log2 1 p( , j) 1/n This bounds the right hand term of (44) in general. Now substituting the bounds (46) and (47) back into (44), we have which completes the proof. The proof of Claim B.11 follows similarly: Proof (of Claim B.11). Again for readability let p := p(i, j) and p := p (i, j). Since we assume p = p + 1/n, we have that p(i, ) p (i, ) p(i, ) + 1/n 1 p( , j) p ( , j) p( , j) + 1/n 1 Differentially Private Maximal Information Coefficients γ = p log2 p p(i, )p( , j) (p + 1/n) log2 p + 1/n p (i, )p ( , j) Now observe that when p = 0 and p = 1/n, then γ+ = 1 n log2 1/n p (i, )p ( , j) where the inequality is due to 1/n2 p (i, ) p ( , j) 1 since p (i, ), p ( , j) p = 1/n. Consider now when p 1/n. Then γ+ = p log2 p p + 1/n p (i, )p ( , j) p(i, )p( , j) n log2 p + 1/n p (i, )p ( , j) p p + 1/n p (i, )p ( , j) p(i, )p( , j) + 1 n log2 p + 1/n p (i, )p ( , j) which follows by the triangle inequality. For the left hand term in (44), say that p p + 1/n p (i, )p ( , j) p(i, )p( , j) > 1 . Then p log2 p p + 1/n p (i, )p ( , j) p(i, )p( , j) p p + 1/n (p(i, ) + 1/n)(p( , j) + 1/n) p(i, )p( , j) p p + 1/n 1 ((n 1)/n)2 Here, the second inequality holds given that the expression is increasing in both p(i, ) and p( , j), both of which are at most (n 1)/n. The third inequality follows because the expression is increasing with p (n 1)/n. Now because 2(29/20) 2.73 e, it follows that for all numbers x, x ex 2(29/20)x, which implies log2 x (29/20)x. Because n/(n 1) (n + 40/29)/n for all n 4, then applying this identity means log2 n n 1 log2 n + 40/29 for all n 4, which holds by assumption of the claim. So in this case, the left hand term (48) is at most 2/n. On the other hand, suppose p p + 1/n p (i, )p ( , j) p(i, )p( , j) < 1 . Then p log2 p p + 1/n p (i, )p ( , j) p(i, )p( , j) p p(i, )p( , j) p (i, )p ( , j) Differentially Private Maximal Information Coefficients Here, the first inequality holds given that p(i, ) p( , j) p (i, ) p (j, ). The second inequality follows because the expression is increasing in p, which can be at most (n 1)/n by assumption. Applying the bound from (49) then yields p log2 p p + 1/n p (i, )p ( , j) p(i, )p( , j) which bounds the left hand term of (48) in general. For the right hand term of (48), consider when p + 1/n p (i, )p ( , j) < 1 . Then because p (i, ), p ( , j) 1 and p 1/n by assumption, the right hand term of (48) becomes 1 n log2 p + 1/n p (i, )p ( , j) p (i, )p ( , j) 1 1/n + 1/n On the other hand, consider when p + 1/n p (i, )p ( , j) > 1 . Then because p 1 1/n and because p (i, ) p ( , j) 1/n2 by the assumption that p (i, ), p ( , j) p 1/n, 1 n log2 p + 1/n p (i, )p ( , j) which bounds the right hand term of (48) in general. Combining the bounds (50) and (51) and substituting back into (48) then shows α+ 2 log2 n which completes the proof. C. MICr-Lap Details In this section, we provide the proofs of Theorem 4.4 (MICr-Lap privacy) and Theorem C.1 (MICr-Lap consistency). C.1. MICr-Lap Privacy We begin with the privacy guarantee of MICr-Lap (restated for convenience), whose proof follows identically to that of Theorem 3.2, which gave the privacy gaurantee for MICe-Lap: Theorem 4.4. MICr-Lap( , L, B, c, ϵ) is ϵ-DP. Proof. This follows by the privacy of the Laplace mechanism, and because truncating to the [0, 1] interval is a post-processing operation that, by Theorem A.7, preserves privacy. C.2. MICr-Lap Consistency We now prove that MICr-Lap is a consistent estimator of MIC*. Theorem C.1. For every finite c > 0, ϵ > 0, α (0, 0.5), and ω(1) B(n) = O(nα), MICr-Lap( , L, B, c, ϵ) is a consistent estimator of MIC*( ). Proof. Recall from Section 4 that the standard deviation of the MICr-Lap mechanism is at most 2/ϵ) ((4 log2 n)/n + 6/n) , which is decreasing with n for a fixed ϵ. Then coupled with the fact that (non-private) MICr is a consistent estimator of MIC* when ω(1) B(n) = O(nα) for α (0, 0.5) (Theorem B.5), it follows that MICr-Lap is also a consistent estimator under the same parameter settings of B for any c and ϵ. Differentially Private Maximal Information Coefficients D. MICr-Geom Details In this section, we provide more details on the MICr-Geom mechanism and give the proofs of its privacy (Theorem 4.6) and consistency (Theorems 4.7 and D.4). D.1. MICr-Geom Privacy We start with the privacy guarantee of MICr-Geom from Theorem 4.6, which is restated as follows: Theorem 4.6. MICr-Geom(D, L, B, c, ϵ) is ϵ-DP. The proof uses the privacy of the Trunc Geom mechanism from (Ghosh et al., 2012), as well as the composition property of differentially private mechanisms. We state these two tools here: Theorem D.1 (Privacy of Trunc Geom, (Ghosh et al., 2012)). Assume that f is the value of some function of a dataset D R2 n with ℓ1 sensitivity 1, and range {0, . . . , n}. Then for any ϵ > 0 and n, Trunc Geom(ϵ, n, f) is ϵ-DP. Theorem D.2 (General Composition of DP Mechanisms, (Dwork & Roth, 2014)). For any mechanisms A1 : X1 Y1 and A2 : X2 Y2, if A1 and A2 are ϵ1 and ϵ2-DP, respectively, then the mechanism A : (X1 X2) (Y1 Y2) defined by A(x1, x2) = (A1(x1), A2(x2)) for x1 X1, x2 X2 is (ϵ1 + ϵ2)-DP. The differential privacy of MICr-Geom follows from Theorems D.1, D.2, and from the post-processing principle from Theorem A.7: Proof (of Theorem 4.6). For each ℓ 2, the counts of all entries of the noisy count matrix b A are all independently (ϵ/2)-DP via the privacy of the Trunc Geom mechanism from Theorem D.1. Since at most two corresponding entries between AD,G and AD ,G can have non-zero difference for any neighboring D D and any fixed G, it follows from Theorem D.2 that the entire noisy count matrix b A is ϵ-DP. The same argument holds independently for all k 2. Then by the post-processing principle, every (k, ℓ) entry of c MR(L,c) D,ϵ is independently ϵ-DP. Because the MICr-Geom(D, L, B, c, ϵ) statistic only returns one such entry, the final output of the mechanism is ϵ-DP. D.2. Computing MICr-Geom Recall from the definition of MICr-Geom in Section 4 that for a fixed ℓ, we construct a cℓ ℓmatrix b AD,Γ where each entry is generated using the Trunc Geom mechanism from Definition 4.5. Because the Trunc Geom outputs range from 0 to n, for any n and ϵ, an array representing the exact PMF function of the Trunc Geom distribution can be constructed in linear time, with subsequent samples done in constant time. Because the subsequent steps of computing MICr-Geom do not differ from computing MICr, the run time of the former is asymptotically equivalent to the latter (i.e., from Theorem B.1). Stated formally: Theorem D.3. For any dataset D of size n restricted to L, B := B(n), c > 0, and ϵ > 0, MICr-Geom(D, L, B, c, ϵ) can be computed using OPTIMIZEAXIS and a pre-computed Trunc Geom distribution in O(c2B4) time. D.3. MICr-Geom Consistency In this section, we develop the proof of Theorem D.4, which shows that the MICr-Geom statistic is a consistent estimator of MIC*. Formally, we prove the following theorem: Theorem D.4. For every finite c > 0, ϵ > 0, α (0, 0.5), and ω(1) B(n) = O(nα), MICr-Geom( , L, B, c, ϵ) is a consistent estimator of MIC*( ). The proof of the theorem leverages Theorem 4.7, which shows that the error introduced by the MICr-Geom mechanism in the (k, ℓ) th entry of the range-equicharacteristic matrix vanishes as n grows. Again restated: Theorem 4.7 (Added error of MICr-Geom). Fix any α (0, 0.5), finite c > 0, ϵ > 0, and dataset D of size n. For sufficiently large n, there exists some a > 0 such that c MR(L,c) D,ϵ k,ℓ MR(L,c) D = O((c/ϵ)n a) for all kℓ B(n) = O(nα) simultaneously with probability at least 1 O(n 2). Granting Theorem 4.7 true for now, the proof of Theorem D.4 is straightforward: Differentially Private Maximal Information Coefficients Proof (of Theorem D.4). Fix any jointly-distributed pair of random variables Π = (X, Y ) with range bounded by L. For any 0 < α < 0.5, fix B(n) = O(nα), and fix finite c > 0 and ϵ > 0. Our goal is to show that MICr-Geom(Dn, L, B(n), c, ϵ) MIC*(Π) = sup MG Π in probability as n . This means that for every τ > 0 and every 0 < p 1, we must show that there exists some N such that Pr max k,ℓ: kℓ B(n) c MR(L,c) Dn,ϵ k,ℓ sup MG Π < τ > 1 p (52) for all n N. By the triangle inequality, observe that max k,ℓ: kℓ B(n) c MR(L,c) Dn,ϵ k,ℓ sup MG Π max k,ℓ: kℓ B(n) c MR(L,c) Dn,ϵ k,ℓ max k,ℓ: kℓ B(n) + (53) max k,ℓ: kℓ B(n) k,ℓ sup MG Π By Theorem B.5 (consistency of the non-private MICr statistic), the term in (54) converges in probability to 0 as n . Similary, by Theorem 4.7, the term in (53) converges in probability to 0 as n . Thus for any τ and p, there exist integers N1 and N2 such that both terms are each bounded from above by τ/2 with probability greater than 1 p for all n max{N1, N2}, which completes the proof. It reamins to prove Theorem 4.7, which we develop in the next subsection. D.4. Proof of Theorem 4.7 The proof of this theorem relies on deriving a bound on the change in mutual information between the noisy distribution b Pϵ D,Γ and the non-noisy PD,Γ, where Γ is the master range-equipartition grid for a fixed (wlog) k ℓ. First, observe that because max k,ℓ: kℓ B(n) c MR(L,c) Dn,ϵ k,ℓ max k,ℓ: kℓ B(n) max k,ℓ: kℓ B(n) c MR(L,c) Dn,ϵ k,ℓ MR(L,c) Dn our goal will be to derive a sufficiently small uniform upper bound on c MR(L,c) Dn,ϵ k,ℓ MR(L,c) Dn that holds for any kℓ B(n) with probability at least 1 p. Taking a union bound over all possible k, ℓpairs (of which there at are most B(n)2) means that the uniform bound on the absolute difference holds simultaneously for all kℓ B(n) with probability at least 1 B(n)2p. To derive an upper bound on (56), recall that for a fixed (wlog) k ℓ, the grid Γ := Γc,ℓis the master range-equipartition grid for R(L, c, k, ℓ). Then c MR(L,c) Dn,ϵ k,ℓ MR(L,c) Dn = max G R(L,c,k,ℓ) I b Pϵ D,Γ,G max G R(L,c,k,ℓ) I (PD,Γ,G) max G R(L,c,k,ℓ) I b Pϵ D,Γ,G I (PD,Γ,G) (57) where the last line is due to the triangle inequality and since log2 k 1. Now for suitable B(n), and for a fixed k ℓsuch that kℓ B(n), we claim the following probabilistic bound on (57). Lemma D.5. Fix any finite c > 0, ϵ > 0, α (0, 0.5), and B(n) O(nα). For sufficiently large n and any kℓ B(n), let Γ denote the master range-equipartition for R(L, c, k, ℓ). Then there exists some u > 0 such that, for any dataset D of size n and G R(L, c, k, ℓ), I b Pϵ D,Γ,G I (PD,Γ,G) O c with probability at least 1 O(n 3). Differentially Private Maximal Information Coefficients If we again grant this lemma true for now, the proof of Theorem 4.7 follows: Proof (of Theorem 4.7). By Lemma D.5, for any kℓ B(n) = O(nα), expression (57) is bounded from above by O(c/(ϵnu)) for some constant u > 0 (that depends on k, ℓ) with probability at least 1 O(n 3). Let a > 0 be the minimum of all such constants over all kℓ B(n). Then for all kℓ B(n), c MR(L,c) Dn,ϵ k,ℓ MR(L,c) Dn with probability at least 1 O(B(n)2/n3) 1 O(n 2). The error probability holds simultaneously over all kℓ B(n) and follows by taking a union bound over all such entries, of which there at most B(n)2 O(n2α) O(n) since α < 0.5. In particular, this means the bound holds for the maximizing entry of expression (55), which completes the proof. It remains to prove Lemma D.5, which requires the most technical machinery. D.4.1. PROOF OF LEMMA D.5 We will use a tool from (Reshef et al., 2016), which relates the statistical distance between two discrete distributions to their difference in mutual information. For two discrete distributions Π and Ψ over [k] [ℓ], let DT V (Π, Ψ) denote their statistical (total variation) distance, and recall that DT V (Π, Ψ) = 1 2 Π Ψ 1 . We have the following lemma of Reshef et al.: Lemma D.6 ((Reshef et al., 2016) Proposition 40, Appendix B). Let Π and Ψ be discrete distributions over [k] [ℓ] for k, ℓ 2. For any 0 < δ 1/4, if DT V (Π, Ψ) δ, then | I(Π) I(Ψ) | O (δ log2((min{k, ℓ})/δ)) . For a fixed dataset D of size n, finite c > 0, a fixed k, ℓ, and a fixed G R(L, c, k, ℓ) with master range-equipartition Γ, we will bound the difference in mutual information between b Pϵ D,Γ,G and PD,ΓG by providing an upper bound on DT V (b Pϵ D,Γ,G, PD,Γ,G). But given that every G R(L, c, k, ℓ) is a subpartition of Γ, it follows by the triangle inequality that DT V (b Pϵ D,Γ,G, PD,Γ,G) DT V (b Pϵ D,Γ, PD,Γ) , (58) and thus our goal will be to provide a probabilistic upper bound on DT V (b Pϵ D,Γ, PD,Γ). Assume wlog that k ℓ, which means that the master range-equipartition grid Γ is of size cℓ ℓ. Fix ϵ > 0, and let bp(i, j) and p(i, j) denote the (i, j) entries of b Pϵ D,Γ and PD,Γ. By definition of the MICr-Geom mechanism from Section 4.3, recall that for all (i, j) [cℓ] [ℓ], we define ba(i, j) Trunc Geom(ϵ/2, n, p(i, j) n) i,j ba(i, j) bp(i, j) = ba(i, j) For all (i, j) let i,j be a random variable defined by i,j = ba(i, j) p(i, j) n , (59) which can be thought of as the change in number of datapoints in cell (i, j) after applying the Trunc Geom mechanism. For all (i, j), the value i,j is distributed like a doubly-geometric random variable centered at 0 with parameter e( ϵ/2), but with truncation at n (p(i, j) n) and (p(i, j) n). So we can rewrite each bp(i, j) as bp(i, j) = p(i, j) n and moreover, we can write bn = n + X i,j i,j . (61) Differentially Private Maximal Information Coefficients It follows that DT V (b Pϵ D,Γ, PD,Γ) = 1 i,j | bp(i, j) p(i, j) | bn p(i, j) . By the triangle inequality, and increasing the 1/2 term to 1, we have DT V (b Pϵ D,Γ, PD,Γ) X i,j p(i, j) + X where the final equality holds because P i,j p(i, j) = 1. The following lemma gives a uniform upper bound on | i,j| that will translate into bounds on both terms of (62). Note that the lemma is stated and proven with respect to the fixed parameters (namely ϵ) already stated in this section. Lemma D.7. For any (i, j) [cℓ] [ℓ] and for any number x > 0, Pr (| i,j| > x) exp( (ϵ/2) x) . Proof. Recall i,j is a doubly-geometric random variable with parameter e( ϵ/2) centered at 0 and truncated at n (p(i, j) n) and (p(i, j) n). Let denote a non-truncated doubly-geometric random variable with parameter e( ϵ/2) centered at 0. It is easy to see that for any number x > 0, Pr (| i,j| > x) Pr (| | > x) . Then using the CDF of doubly-geometric random variables, we have Pr (| | > x) = 1 1 exp( ϵ/2) 1 + exp( ϵ/2) x exp( (ϵ/2) |x|) The doubly-geometric distribution with parameter e( ϵ/2) is a discrete approximation of a continuous Laplace distribution with parameter 2/ϵ. It is known that for any x > 0 Pr (|Lap(2/ϵ)| > x) = exp( (ϵ/2) x) , 1 1 exp( ϵ/2) 1 + exp( ϵ/2) x exp( (ϵ/2) |x|) exp( (ϵ/2) x) . Thus the tail of every i,j r.v. with parameter exp( (ϵ/2)) is dominated by the tail of the non-truncated r.v. with the same parameter, which is dominated by the tail of a continuous Laplace r.v. with parameter 2/ϵ, completing the proof. Using this tail bound (which observe applies uniformly to all (i, j)), we have the following corollary. Corollary D.8. For every (i, j) and for any d > 0, Pr (| i,j| > d ln n) exp( (ϵ/2)d ln n) = n ((ϵ/2)d) . Differentially Private Maximal Information Coefficients By a union bound over all cℓ2 entries (i, j), it follows that for all d > 0, X i,j | i,j| cℓ2 (d ln n) (63) with probability all but (cℓ2)/(n ((ϵ/2)d)). Because bn = n + P i,j i,j (from expression (61)), then with this same probability: i,j | i,j| n + cℓ2d ln n (64) i,j | i,j| n cℓ2d ln n . (65) Using these bounds on ˆn and P i,j i,j, we state and prove the following two bounds on the terms from expression (62). Lemma D.9. For sufficiently large n, and assuming B(n) = O(nα) for α (0, 0.5), there exists a constant u1 > 0 such that n bn 1 O c ϵnu1 with probability at least 1 O(n 3). Proof. First, suppose that (n/bn) 1. Then |(n/bn) 1| is maximized when bn is small. Using the lower bound on bn from (65), it follows that bn 1 = n bn cℓ2d ln n n cℓ2d ln n with probability all but n ((ϵ/2)d). By the assumption that α < 0.5, it follows that ℓ2 (B(n))2 O(n2α) = o(n). Setting d := 8/ϵ, it follows that cℓ2d ln n O((c/ϵ)n1+t1) for some constant t > 0, and that n cℓ2d ln n = Ω(n), which means that n bn 1 cℓ2d ln n n cℓ2d ln n O c with probability all but (cℓ2)/n (ϵ/2)d = O(cn2α/n 4) = O(n 3). On the other hand, suppose (n/bn) < 1. Then |1 (n/bn)| is maximized when bn is large, and using the upper bound on bn from (64), it follows that cℓ2d ln n n + cℓ2d ln n with probability all but n ((ϵ/2)d). By similar arguments as in the first case and again setting d := 8/ϵ, it then follows that for some constant v > 0, 1 n cℓ2d ln n n + cℓ2d ln n O c with probability all but n 3. Setting u1 := min{t, v} then concludes the proof. Lemma D.9 gives a bound on the first term from expression 62. We now prove a bound on the second term. Lemma D.10. For sufficiently large n, and assuming B(n) = O(nα) for α (0, 0.5), there exists a constant u2 > 0 such that X with probability all but n 3. Differentially Private Maximal Information Coefficients Proof. As in the proof of Lemma D.9, set d := 8/ϵ. Again using the fact that ℓ2 B(n)2 O(n2α) = o(n) by the assumption that α < 0.5, observe from (65) that ˆn > 0 with probability all but O(n 3) for sufficiently large n. So with this same probability we can write P i,j | i,j| ˆn cℓ2(8/ϵ) ln n n cℓ2(8/ϵ) ln n . Here, since ℓ2 ln n n2α ln n, it follows that the denominator of the final expression is Ω(n), and the numerator is O((c/ϵ)n1+u2) for some u2 > 0. Thus with probability all but O(n 3), it follows that cℓ2(8/ϵ) ln n n cℓ2(8/ϵ) ln n O c ϵnu2 Together, Lemmas D.9 and D.10 give the following probabilistic bound on the total variation distance between b Pϵ D,Γ and PD,Γ from expression (62), and thus a proof of Lemma D.5. Proof (of Lemma D.5). By expressions (58) and (62), and using Lemmas D.9 and D.10, it follows that for sufficiently large n, there exist constants u1, u2 > 0 such that for any G R(L, c, k, ℓ), DT V (b Pϵ D,Γ,G, PD,Γ,G) DT V (b Pϵ D,Γ, PD,Γ) O c ϵnu1 with probability at least 1 O(n 3) for sufficiently large n, where u3 := min{u1, u2} > 0. Then by Lemma D.6, it follows that I b Pϵ D,Γ,G I (PD,Γ,G) O c ϵnu3 log2(n(α+u3)) O c with probability at least 1 O(n 3) for some u3 > u > 0 when n is sufficiently large. E. Experimental Results Appendix In this section, we provide more details on the methodologies used for our experimental results from Section 5. E.1. Implementation Details As mentioned in Section 2, the MICe statistic in Definition 2.3 (and the analogous definitions of MICr and its private variants) varies slightly from the definition of MICe from Reshef et al. (2016), specifically when defining the maximization space of individual entries in the equicharacteristic matrix3. In particular, for k ℓ, when ℓ> B, Reshef et al. (2016) set the master row partition used in the optimization as a mass equipartition of size c(B/ℓ). In contrast, Definition 2.3 sets the master row partition as a mass equipartition of size c ℓ, even when ℓ> B (note that the definitions are identical in the case when ℓ The advantage of using these smaller master partitions is computational: the MICe variant of Reshef et al. (2016) can be computed in O(c2B2.5) time (Appendix H.1, Reshef et al. (2016)), in contrast with the O(c2B4) running time from Theorem A.1 using Definition 2.3. On the other hand, for B := B(n) = O(nα) where α (0, 0.5), the consistency of this variant can only be shown to hold when the output of the statistic is the maximum (k, ℓ) entry of the equicharacteristic matrix where k ℓ B and k and ℓare individually at most B. Note however that although consistency for this variant is only shown to hold when imposing this additional constraint on k and ℓ, the statistic is still nonetheless both defined and can be computed (i.e., as in the implementation of Albanese et al. (2018)) as a maximization over all equicharacteristic matrix entries where k ℓ B. The MICr statistic from Definition 4.1 (and its private variants) can be adjusted in an analogous way to reduce the size of the master row range equipartitions when B < ℓ B/2. Again, the benefit is faster computation at the expense of a slightly different consistency gaurantee (analogous to that of the MICe variant described above). 3Reshef et al. (2016) refer to this as the clumped variant of MICe, but to avoid confusion we refer to this as just MICe. Differentially Private Maximal Information Coefficients Primarily for computational considerations, our experiments in this work use implementations of MICe and MICr (and their private variants) that follow the original definition of Reshef et al. (2016) (i.e., use smaller master partition sizes for ℓ> B). While these variants hold slightly different consistency properties than their counterparts defined in Sections 2, 3, and 4, in practice the outputs between pairs of corresponding variants are similar. Note that an end-user with a fixed computational budget could use either set of definitions and tune the B and c parameters accordingly to meet their budget. Moreover, we remark that the ϵ-DP guarantees of the private MICe and MICr mechanisms introduced in this work apply to either regime of master grid size settings. In our experiments, to compute MICe we used the MINEPY library implementation from Albanese et al. (2018), and to compute MICr and its private variants, we developed an implementation available at https://github.com/ jlazarsfeld/dp-mic. E.2. Synthetic Data Experiments E.2.1. FUNCTIONAL RELATIONSHIPS USED TO GENERATE DISTRIBUTIONS As described in Section 5, we evaluated our private mechanisms on distributions based on a set of 21 functional relationships originally defined by Reshef et al. (2011). We list these functions here: F1: f(x) = 0.2 sin(12x 6) + 1.1(x 1) + 1 F2: f(x) = 0.15 sin(11x π) + (x + 0.05) F3: f(x) = 0.1 sin(48x) + 2(x 0.05) F4: f(x) = 0.2 sin(48x) + 2(x 0.05) F5: f(x) = 0.4 cos(7x π) + 0.5 F6: f(x) = 0.4 cos(14x π) + 0.5 F7: f(x) = 10 (x 0.6)3 + 2 x2 + (1.5 3x) F8: f(x) = 4 (10 (x 0.6)3 + 2 x2 + (1.5 3x)) 1.4 F9: if x 0.99 then f(x) = x 99, else f(x) = 99x 98 F10: f(x) = 2x 1 F11: f(x) = 8(x 0.3) 1 F12: f(x) = x F13: f(x) = 4 (x 0.5)2 + 0.1 F14: f(x) = 0.4 sin(9x π) + 0.5 F15: f(x) = 0.4 sin(8x π) + 0.5 F16: f(x) = 0.4 sin(16x π) + 0.5 F17: if x < 0.491 then f(x) = 0.05, else if x > 0.509 then f(x) = 0.95, else f(x) = 50 (x 0.5) + 0.5 F18: f(x) = 0.4 cos(5x π (1 + x)) + 0.5 F19: f(x) = 0.4 sin(6x π (1 + x)) + 0.5 F20: if x 0.0528 then f(x) = 18x, else if x 0.1 then f(x) = 1 x 9, else f(x) = 18x + 1.9 F21: if x 0.0051 then f(x) = 190x, else if x 0.01 then f(x) = 1 x 99, else f(x) = 198x + 1.99 Differentially Private Maximal Information Coefficients For each function, we generated 9 joint distributions by placing k = 100 generating points on the function graph at evenly spaced intervals. We considered a bivariate Gaussian distribution centered at each point with ρ = 0 and identical variances. The magnitude of the variance was set according to the desired R2 value of the resulting noisy functional relationship using the method described in Reshef et al. (2018). E.2.2. APPROXIMATELY COMPUTING MIC* For a joint distribution Π = (X, Y ), we used the method described in Section 3.5 of Reshef et al. (2016) to approximately compute MIC*(Π). This involves maximizing the mutual information of I(X|Q, Y |P ) for increasingly large discrete partitions P of size k chosen from a dense master mass equipartition Γ, and where Q is a dense, fixed-sized master mass equipartition . For this, we set the size of Γ to 260 and the size of Q to 360, and we found the computation insensitive to increases in master grid sizes beyond this point. E.2.3. TUNING PARAMETERS FOR MICR-LAP AND MICR-GEOM To determine optimal B and c parameter settings for MICr-Lap and MICr-Geom at a particular value of n and ϵ, we defined the following objective function WSUM using the set of 189 distributions in Q. First, we sorted Q in increasing order of MIC* value (using the method described in the previous subsection). For the i th distribution Πi in sorted order, we defined a weight wi by taking the length of the interval between the midpoint of MIC*(Πi 1) and MIC*(Πi) and the midpoint of MIC*(Πi) and MIC*(Πi+1). For the distributions with smallest and largest MIC* values, we used 0 and 1 as left and right interval endpoints respectively. Then for a fixed mechanism, B, c, n, and ϵ, we took the weighted sum of the mechanism s average unsigned error on each distribution wrt MIC* using the wi s as weights. By minimizing this function wrt the B and c parameters for each mechanism and (n, ϵ, B, c) combination, our goal was to determine parameter settings that ensured equal levels of error across the entire spectrum of low-correlation (low MIC*) and high-correlation (high MIC*) distributions. MICr-Geom MICr-Lap ϵ = 1.0 ϵ = 0.1 ϵ = 1.0 ϵ = 0.1 n = 25 (2, 12) (2, 6) (5, 8) (5, 6) n = 250 (1, 40) (2, 10) (5, 40) (5, 40) n = 500 (1, 40) (2, 20) (5, 60) (5, 80) n = 1000 (1, 60) (2, 40) (5, 80) (5, 100) n = 5000 (1, 150) (1, 40) (5, 150) (5, 125) n = 10000 (1, 150) (1, 80) (5, 150) (5, 150) Table 2. Each entry is the optimal (c, B) settings for the corresponding mechanism and (n, ϵ) that was found by minimizing a weighted sum of the mechanism s absolute error (wrt MIC*) over all Π Q. Table 2 shows these optimized parameters for the MICr-Lap and MICr-Geom mechanisms. For both mechanisms and both ϵ values, the optimal B values are generally increasing with n, which aligns with the intuition that both mechanisms outputs converge toward MIC* with larger n. Notice also that for ϵ=0.1 and small n, the optimal c value of MICr-Geom is 2. Although the error of MICr-Geom wrt to MICr decreases with c, we expect the error of MICr wrt MIC* to decrease with larger c and B. For small n and ϵ, this leads to an optimization tradeoff that is more pronounced. E.2.4. BIAS/VARIANCE ANALYSIS FOR ϵ = 0.1 Figure 4 shows the boxplots of the bias and variance of each private mechanism over the set of all distributions in Q as n varies and ϵ = 0.1. These are the analogous plots for Figure 2 in Section 5 for ϵ = 1. Again notice that as n grows, the IQR of bias tends to decrease for the MICr-Lap and MICr-Geom mechanisms, but this is less pronounced for MICr-Geom compared to the ϵ = 1 case. Here, the bias/variance tradeoff between MICr-Lap (lower bias, higher variance) and MICr-Geom (higher bias, lower variance) is much more apparent, especially at smllaller values of n. E.3. Real Data Experiments Table 3 shows the median bias and variance for each mechanism over all datasets in each collection for ϵ = 0.1 (analogous to Table 1 from Section 5). Differentially Private Maximal Information Coefficients Figure 4. Boxplots of bias (top) and variance (bottom) of each private mechanism (over 50 iterations) over all distributions in Q for ϵ=0.1 and varying n. MICe-Lap MICr-Lap MICr-Geom Spellman23 0.20 (0.24) 0.19 (0.24) 0.31 (0.05) Baseball 0.36 (0.24) 0.25 (0.18) 0.33 (0.01) Spellman4381 0.41 (0.24) 0.03 (0.01) 0.14 (8e-4) Table 3. The median bias (average signed error wrt MICe over 100 runs) and median variance (over 100 iterations) of each private mechanism across all datasets of each collection for ϵ=0.1. Figure 5. Bias and variance boxplots for each mechanism over datasets (pairs) in the Spellman4381 collection (left) and Baseball collection (right) binned by non-private MICe score for ϵ=0.1. Figure 5 shows the bias and variance of each mechanism for the Spellman4381 and Baseball collections at ϵ = 0.1 in the binned setting (analogous to Figure 3 in Section 5). Again notice the bias/variance tradeoff between the MICr-Lap and MICr-Geom mechanisms, which is especially apparent at this level of ϵ in the smaller (n = 337) Baseball datasets.