# measuring_dependence_with_matrixbased_entropy_functional__0f1b76a9.pdf Measuring Dependence with Matrix-based Entropy Functional Shujian Yu1, Francesco Alesiani1, Xi Yu2, Robert Jenssen3, Jose Principe2 1 NEC Laboratories Europe 2 University of Florida 3 Ui T - The Arctic University of Norway {Shujian.Yu,Francesco.Alesiani}@neclab.eu, {yuxi,principe}@ufl.edu, robert.jenssen@uit.no Measuring the dependence of data plays a central role in statistics and machine learning. In this work, we summarize and generalize the main idea of existing information-theoretic dependence measures into a higher-level perspective by the Shearer s inequality. Based on our generalization, we then propose two measures, namely the matrix-based normalized total correlation and the matrix-based normalized dual total correlation, to quantify the dependence of multiple variables in arbitrary dimensional space, without explicit estimation of the underlying data distributions. We show that our measures are differentiable and statistically more powerful than prevalent ones. We also show the impact of our measures in four different machine learning problems, namely the gene regulatory network inference, the robust machine learning under covariate shift and non-Gaussian noises, the subspace outlier detection, and the understanding of the learning dynamics of convolutional neural networks, to demonstrate their utilities, advantages, as well as implications to those problems. Introduction Measuring the strength of dependence between random variables plays a central role in statistics and machine learning. For the linear dependence case, measures such as the Pearson s ρ, the Spearman s rank and the Kendall s τ are computationally efficient and have been widely used. For the more general case where the two variables share a nonlinear relationship, one of the most well-known dependence measures is the mutual information and its modifications such as the maximal information coefficient (Reshef et al. 2011). However, real-world data often contains three or more variables which can exhibit higher-order dependencies. If bivariate based measures are used to identify multivariate dependence, wrong conclusions may drawn. For example, in the XOR gate, we have y = x1 x2 with x1, x2 being binary random processes with equal probability. Although x1, x2 individually are independent to y, the full dependence is synergistically contained in the union of {x1, x2} and y. On the other hand, in various practical applications, the observational data or variables of interest lie on a highdimensional space. Thus, it is desirable to extend the theory of scalar variable dependence to an arbitrary dimension. Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Despite that tremendous efforts have been made based on the seven postulates (on measure of dependence on pair of variables) proposed by Alfr ed R enyi (R enyi 1959), the problem of measuring dependence (especially in a nonparametric manner) still remains challenging and unsatisfactory (Fernandes and Gloor 2010). This is not hard to understand. Note that, most of the existing measures are defined as some functions of a density. Thus, a prerequisite for them is to estimate the underlying data distributions, a notoriously difficult problem in high-dimensional space. Moreover, current measures primarily focus on two special scenarios: 1) the dependence associated with each dimension of a random vector (e.g., the multivariate maximal correlation (MAC) (Nguyen et al. 2014)); and 2) the dependence between two random vectors (e.g., the Hilbert Schmidt Independence Criterion (HSIC) (Gretton et al. 2005)). The former is called multivariate correlation analysis in machine learning, and the latter is commonly referred to as random vector association in statistics. Our main contributions are multi-fold: We provide a unified view of existing informationtheoretic dependence measures and illustrate their inner connections. We also generalize the main idea of these measures into a higher-level perspective by the Shearer s inequality (Chung et al. 1986). Motivated by our generalization, we suggest two measures, namely the matrix-based normalized total correlation (T α) and the matrix-based normalized dual total correlation (D α), to quantify the dependence of data by making use of the recently proposed matrix-based R enyi s αentropy functional estimator (Sanchez Giraldo, Rao, and Principe 2014; Yu et al. 2019)1. We show that T α and D α enjoy several appealing properties. First, they are not constrained by the number of variables and variable dimension. Second, they are differentiable, which make them suitable to be used as loss functions to train neural networks. We show that our measures offer a remarkable performance gain to benchmark methods in applications like gene regulatory network (GRN) inference and subspace 1Code of our measures and supplementary material of this work are available at: https://bit.ly/AAAI-dependence. The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) outlier detection. They also provide insights to challenging topics like the understanding of the dynamics of learning of Convolutional Neural Networks (CNNs). Motivated by (Greenfeld and Shalit 2020) that trains a neural network by encouraging the distribution of the prediction residuals e is statistically independent of the distribution of the input x, we show that our measure (as a loss) improves robust machine learning against the shift of the input distribution (a.k.a., the covariate shift (Sugiyama et al. 2008)) and non-Gaussian noises. We also establish the connection between our loss to the minimum error entropy (MEE) criterion (Erdogmus and Principe 2002), a learning principle that has been extensively investigated in signal processing and process control. Background Knowledge Problem Formulation We consider the problem of estimating the total amount of dependence of the dm-dimensional components of the random variable y = [y1; y2; ; y L] Rd, in which the m-th component ym Rdm and d = PL m=1 dm. The estimation is based purely on N i.i.d. samples from y, i.e., {yi}N i=1. Usually, we expect the derived statistic to be strictly between 0 and 1 for improved interpretability (Wang et al. 2017). Obviously, when L = 2, we are dealing with random vector association between y1 Rd1 and y2 Rd2. Notable measures in this category include the HSIC, the Randomized Dependence Coefficient (RDC) (Lopez-Paz, Hennig, and Sch olkopf 2013), the Cauchy-Schwarz quadratic mutual information (QMI CS) (Principe et al. 2000) and the recently developed mutual information neural estimator (MINE) (Belghazi et al. 2018). On the other hand, in case of di = 1 (i = 1, 2, , L), the problem reduces to the multivariate correlation analysis on each dimension of y. Examples in this category are the multivariate Spearman s ρ (Schmid and Schmidt 2007) and the MAC. Different from the above mentioned measures, we seek a general measure that is applicable to multiple variables in an arbitrary dimensional space (i.e., without constrains on L and di). But, at the same time, we also hope that our measure is interpretable and statistically more powerful than existing counterparts in quantifying either random vector association or multivariate correlation. A Unified View of Information-Theoretic Measures From an information-theoretic perspective, a dependence measure M that quantifies how much a random vector y = {y1; y2; ; y L} Rd deviates from statistical independence in each component can take the form of: Pr y1, y2, , y L : i=1 Pr yi ! where diffrefers to a measure of difference such as divergence or distance. If one instantiates diff() with Kullback Leibler (KL) divergence, Eq. (1) reduces to the renowned Total Correla- tion (Watanabe 1960): Pr y1, y2, , y L || i=1 Pr yi ! H(y1, y2, , y L), (2) where H denotes entropy or joint entropy. Most of the existing measures approach multivariate dependence through TC by a decomposition into multiple small variable sets2 (proof in supplementary material): i=1 H(yi) H(yi|y[i 1]). (3) In fact, these measures only vary in the way to estimate H(yi) and H(yi|y[i 1]). For example, multivariate correlation (Joe 1989) and MAC (Nguyen et al. 2014) use Shannon discrete entropy, whereas CMI (Nguyen et al. 2013) resorts to the cumulative entropy (Rao et al. 2004) which can be directly applied on continuous variables. Although such progressive aggregation strategy helps a measure scales well to high dimensionality, it is sensitive to the ordering of the variables, i.e., Eq. (3) is not permutation invariant. One should note that, there are a total of L! possible permutations, which makes the decomposition scheme always achieve sub-optimal performances. There are only a few exceptions avoid to the use of TC. A notable one is the Copula-based Kernel Dependence Measures (C-KDM) (P oczos, Ghahramani, and Schneider 2012), which instantiates diff() in Eq. (1) with the Maximum Mean Discrepancy (MMD) (Gretton et al. 2012) and measures the discrepancy between Pr y1, y2, , y L and QL i=1 Pr yi by first taking an empirical copular transform on both distributions. Although C-KDM is theoretically sound and permutation invariant, the value of C-KDM is not upper bounded, which makes it suffer from poor interpretability. Last and not the least, the above mentioned measures can only deal with scalar variables. Thus, it still remains challenging when each variable is of an arbitrary dimension. Generalization of TC with Shearer s Inequality One should note that, TC is not the only non-negative measures of multivariate dependence. In fact, it can be seen as the simplest member of a large family, all obtained as special cases of an inequality due to Shearer (Chung et al. 1986). Given a set of L random variables y1, y2, , y L. Denote ϕ the family of all subsets of [L] with the property that every member of [L] lies in at least k members of ϕ, the Shearer s inequality states that: H(y1, y2, , y L) 1 S ϕ H(yi, i S). (4) 2Throughout this work, we use [n] := {1, 2, , n} and [n] \ {i} := [n]\i. For brevity, we frequently abbreviate the variable set {y1, y2, , yn} with y[n], and {y1, , yi 1, yi+1, , yn} with y[n]\i. Obviously, TC (i.e., Eq. (2)) is obtained when ϕ = L 1 . Another important inequality arises when we take ϕ = L L 1 , in which the Shearer s inequality suggests an alternative non-negative multivariate dependence measure as: i=1 H(y[L]\i) (L 1)H(y1, y2, , yd). (5) Eq. (5) is also called the dual total correlation (DTC) (Sun 1975) and has an equivalent form (Austin 2018; Abdallah and Plumbley 2012) (see proof in supplementary material): D(y) = H(y1, y2, , yd) i=1 H(yi|y[L]\i) The Shearer s inequality suggests the existence of at least (L 1) potential mathematical formulas to quantify the dependence of data, by just taking the gap between the two sides. Although all belong to the same family, these formulas emphasize different parts of the joint distributions and thus cannot be simply replaced by each other (see an illustrate figure in the supplementary material). Finally, one should note that, the Shearer s inequality is just a loose bound on the sum of partial entropy terms. It has been recently refined further in (Madiman and Tetali 2010). We leave a rigorous treatment to the tighter bound as future work. Matrix-based Dependence Measure Our Measures and their Estimation We exemplify the use of Shearer s inequality in quantifying data dependence with TC and DTC in this work. First, to make TC and DTC more interpretable, i.e., taking values in the interval of [0 1], we normalize both measures as follows: h PL i=1 H(yi) i H(y1, y2, , y L) h PL i=1 H(yi) i max i H(yi) , (7) h PL i=1 H(y[L]\i) i (L 1)H(y1, y2, , y L) H(y1, y2, , y L) . (8) Eqs. (7) and (8) involve entropy estimation in highdimensional space, which is a notorious problem in statistics and machine learning (Belghazi et al. 2018). Although data discretization or entropy term decomposition has been used before to circumvent the curse of dimensionality , they all have their own intrinsic limitations. For data discretization, selecting a proper data discretization strategy is a tricky problem and an improper discretization may lead to serious estimation error. For entropy term decomposition, the resulting measure is no longer permutation invariant. Different from earlier efforts, we introduce the recent proposed matrix-based R enyi s α-entropy functional (Sanchez Giraldo, Rao, and Principe 2014; Yu et al. 2019), which evaluate entropy terms in terms of the normalized eigenspectrum of the Hermitian matrix of the projected data in the reproducing kernel Hilbert space (RKHS), without explicit evaluation of the underlying data distributions. For brevity, we directly give the following definition. Definition 1. (Sanchez Giraldo, Rao, and Principe 2014) Let κ : Y Y 7 R be a real valued positive definite kernel that is also infinitely divisible (Bhatia 2006). Given Y = {y1, y2, , y N}, where the subscript i denotes the exemplar index, and the Gram matrix K obtained from evaluating a positive definite kernel κ on all pairs of exemplars, that is (K)ij = κ(yi, yj), a matrix-based analogue to R enyi s α-entropy for a normalized positive definite (NPD) matrix A of size N N, such that tr(A) = 1, can be given by the following functional: Sα(A) = 1 1 α log2 (tr(Aα)) = 1 1 α log2 N X i=1 λi(A)α , (9) where Aij = 1 N Kij Kii Kjj and λi(A) denotes the i-th eigen- value of A. Definition 2. (Yu et al. 2019) Given a collection of N samples {si = (y1 i , y2 i , , y L i )}N i=1, each sample contains L (L 2) measurements y1 Y1, y2 Y2, , y L YL obtained from the same realization, and the positive definite kernels κ1 : Y1 Y1 7 R, κ2 : Y2 Y2 7 R, , κL : YL YL 7 R, a matrix-based analogue to R enyi s α-order joint-entropy among L variables can be defined as: Sα(A[L]) = Sα tr(A1 A2 AL) where (A1)ij = κ1(y1 i , y1 j ), (A2)ij = κ2(y2 i , y2 j ), , (AL)ij = κL(y L i , y L j ), and denotes the Hadamard product. Based on the above definition, we propose a pair of measures, namely the matrix-based normalized total correlation (denoted by T α) and the matrix-based normalized dual total correlation (denoted by D α): h PL i=1 Sα(Ai) i Sα A[L] h PL i=1 Sα(Ai) i max i Sα(Ai) , (11) h PL i=1 Sα A[L]\i i (L 1)Sα A[L] (12) As can be seen, both T α and D α are independent of the specific dimensions of y1, y2, , y L and avoid estimation of the underlying data distributions, which makes them suitable to be applied on data with either discrete or continuous distributions. Moreover, it is simple to verify that both T α and D α are permutation invariant to the ordering of variables. Properties and Observations of T α and D α We present more useful properties and observations of T α and D α. In particular, we want to underscore that they are differentiable and can be used as loss functions to train neural networks. Note that, when L = 2, both T α and D α reduce to the matrix-based normalized mutual information, which we denote by I α. See the supplementary material for proofs and additional supporting results. Property 1. 0 T α 1 and 0 D α 1. Remark. A major difference between our T α and D α to others is that our bounded property is satisfied with a finite number of realizations. An interesting and rather unfortunate fact is that although the statistics of many measures satisfies this desired property, their corresponding estimators hardly follow it (Seth and Pr ıncipe 2012). Property 2. T α and D α reduce to zero iff y1, y2, , y L are independent. Property 3. T α and D α have analytical gradients and are automatically differentiable. Remark. This property complements the theory of the matrix-based R enyi s α-entropy functional (Sanchez Giraldo, Rao, and Principe 2014; Yu et al. 2019), as it opens the door to challenging machine learning problems involving neural networks (as will be illustrated later in this work). Property 4. The computational complexity of T α and D α are respectively O(LN 2)+O(N 3) and O(LN 3), and grows linearly with the number of variables L. Remark. In case of L = 2, both T α and D α cost O(N 3) in time. As a reference, the computational complexity of HSIC is between O(N 2) and O(N 3) (Zhang et al. 2018). However, HSIC only applies for two variables and is not upper bounded. We leave reducing the complexity as future work. But the initial exploration results, shown in the supplementary material, suggest that we can reduce the complexity by taking the average of the estimated quantity over multiple random subsamples of size K N. Observation 1. T α and D α are more statistically powerful than prevalent random vector association measures, like HSIC, d Cov, KCCA and QMI CS, in identifying independence and discovering complex patterns between y1 and y2. We made this observation with the same test data as has been used in (Josse and Holmes 2016; Gretton et al. 2008). The first test data is generated as follows (Gretton et al. 2008). First, we generate N i.i.d. samples from two randomly picked densities in the ICA benchmark densities (Bach and Jordan 2002). Second, we mixed these random variables using a rotation matrix parameterized by an angle θ, varying from 0 to π/4. Third, we added d 1 extra dimensional Gaussian noise of zero mean and unit standard deviation to each of the mixtures. Finally, we multiplied each resulting vector by an independent random ddimensional orthogonal matrix. The resulting random vectors are dependent across all observed dimensions. The second test data is generated as follows (Sz ekely et al. 2007). A matrix Y 1 RN 5 is generated from a multivariate Gaussian distribution with an identity covariance matrix. Then, another matrix Y 2 RN 5 is generated as Y 2 ml = Y 1 mlϵml, m = 1, 2, , N, l = 1, 2, , 5, where ϵml are standard normal variables and independent of Y 1. 0 0.5 1 Angle ( /4) Acceptance Rate of H0 I* ( =1.01) QMI_CS d Cor KCCA HSIC (a) decaying dependence 0 100 200 300 Number of Samples Detection Rate of H1 I* ( =1.01) QMI_CS d Cor KCCA HSIC (b) non-monotonic dependence Figure 1: Power test against prevalent random vector association measures. Our measure is the most powerful one when α = 1.01 and other values (see supplementary material). In each test data, we compare all measures with a threshold computed by sampling a surrogate of the null hypothesis H0 based on shuffling samples in y2 with 100 times. That is, the correspondences between y1 and y2 are broken by the random permutations. The threshold is the estimated quantile 1 τ where τ is the significance level of the test (Type I error). If the estimated measure is larger than the computed threshold, we reject the null hypothesis and argue the existence of an association between y1 and y2, and vice versa. We repeated the above procedure 500 independent trials. Fig. 1 demonstrated the averaged acceptance rate of the null hypothesis H0 (in test data I with respect to different rotation angle θ) and the averaged detection rate of the alternative hypothesis H1 (in test data II with respect to different number of samples). Intuitively, in the first test data, a zero angle means the data are independent, while dependence becomes easier to detect as the angle increases to π/4. Therefore, a desirable measure is expected to have acceptance rate of H0 nearly to 1 at θ = 0. But the rate is expected to rapidly decaying as θ increases. In the second test data, a desirable measure is expected to always have a large detection rate of H1 regardless of the number of samples. Observation 2. T α and D α are more interpretable than their multivariate correlation counterparts in quantifying the dependence in each dimension of y = {y1, y2, , yd} Rd. This observation was made by comparing T α and D α against three popular multivariate correlation measures. They are multivariate Spearman s ρ, C-KDM and IDD (Romano et al. 2016). Fig. 2 shows the average value of the analyzed measures on the following relationships induced on d [1, 9] and n = 1000 points: Data A: The first dimension y1 is uniformly distributed in [0, 1], and yi = (y1)i for i = 2, 3, , d. The total dependence should be 1, because {y2, y3, , yd} depend nonlinearly only on y1. Data B: There is a functional relationship between y1 and the remaining dimensions: y1 = ( 1 d 1 Pd i=2 yi) 2, where {y2, y3, , yd} are uniformly and independently distributed. In this case, the strength of the overall dependence should decrease with the increase of dimension. 2 4 6 8 Number of variables T* ( =1.01) D* ( =1.01) C-KDM Spearman' Expected 2 4 6 8 Number of variables T* ( =1.01) D* ( =1.01) C-KDM Spearman' Expected Figure 2: Raw measure scores on synthetic data with different relationships. di = 1 di Z+ GRN Inference HSIC, d Cov, QMI CS C-KDM, IDD Outlier Detection Understanding CNNs Table 1: Four dependence scenarios. Popular measures in each scenario and potential applications. Machine Learning Applications We present four solid machine learning applications to demonstrate the utility and superiority of our proposed matrix-based normalized total correlation (T α) and matrixbased normalized dual total correlation (D α). The applications include the gene regulatory network (GRN) inference, the robust machine learning under covariate shift and non Gaussian noises, the subspace outlier detection and the understanding of the dynamics of learning of CNNs. The logic behind the organization of these applications is shown in Table 1. We want to emphasize here that the use of normalization depends on the priority given to interpretability. For example, when the measure is employed as a loss function, the normalization does not contribute to performance. However, when we use it to quantify information flow or neural interactions in CNNs, a bounded value is preferred. Gene Regulatory Network Inference Gene expressions form a rich source of data from which to infer GRN, a sparse graph in which the nodes are genes and their regulators, and the edges are regulatory relationship between the nodes. In the first application, we resorted to the DREAM4 challenge (Marbach et al. 2012) data set for reconstructing GRN. There are 5 networks (Net.) in the insilico (simulated) version of this data set, each contains expressions for 10 genes with 136 data points. The goal is to reconstruct the true network based on pairwise dependence between genes. We compared five test statistics (Pearsons ρ, mutual information with respectively bin estimator and KSG estimator (Kraskov, St ogbauer, and Grassberger 2004), maximal information coefficient (MIC) (Reshef et al. 2011) and Iα), and quantitatively evaluate reconstruction qualify by Area Under the ROC curve (AUC). Table 2 clearly indicates our superior performance. Data ρ MI (bin) MI (KSG) MIC Iα Net. 1 0.62 0.59 0.74 0.75 0.78 Net. 2 0.52 0.58 0.76 0.74 0.87 Net. 3 0.44 0.61 0.83 0.76 0.84 Net. 4 0.45 0.60 0.75 0.75 0.75 Net. 5 0.38 0.61 0.88 0.89 0.97 Table 2: GRN inference results (AUC score) on DREAM4 challenge. The first and second best performances are in bold and underlined, respectively. Robust Machine Learning Robust machine learning under domain shift (Quionero Candela et al. 2009) has attracted increasing attentions in recent years. This is justified because the success of deep learning models is highly dependent on the assumption that the training and testing data are i.i.d. and sampled from the same distribution. Unfortunately, the data in reality is typically collected from different but related domains (Wilson and Cook 2020), and is corrupted (Chen et al. 2016b). Let (x, y) be a pair of random variables with x Rp and y R (in regression) or y Rq (in classification), such that x denotes input instance and y denotes desired signal. We assume x and y follow a joint distribution Psource(x, y). Our goal is, given training samples drawn from Psource(x, y), to learn a model f predicting y from x that works well on a different, a-priori unknown target distribution Ptarget(x, y). We consider here only the covariate shift, in which the assumption is that the conditional label distribution is invariant (i.e., Ptarget(y|x) = Psource(y|x)) but the marginal distributions of input P(x) are different between source and target domains (i.e., Ptarget(x) = Psource(x)). On the other hand, we also assume that y (in the source domain) may be contaminated with non-Gaussian noises (i.e., ey = y + ϵ). We focus on a fully unsupervised environment, in which we have no access to any samples x or y from the target domain, i.e., the source-to-target manifold alignment becomes burdensome. Our work in this section is directly motivated by (Greenfeld and Shalit 2020), which introduces the criterion of minimizing the dependence between input x and prediction error e = y f(x) to circumvent the covariate shift, and uses the HSIC as the measure to quantify the independence. We provide two contributions over (Greenfeld and Shalit 2020). In terms of methodology, we show that by replacing HSIC with our new measures (i.e., I α), we improve the prediction accuracy in the target domain. Theoretically, we show that our new loss, namely min I α(x; e) is not only robust against covariate shift and also non-Gaussian noises on y based on Theorem 1. Theorem 1. Minimizing the (normalized) mutual information I(x; e) is equivalent to minimizing error entropy H(e). Remark. The minimum error entropy (MEE) criterion (Erdogmus and Principe 2002) has been extensively studied in signal processing to address non-Gaussian noises with both theoretical guarantee and empirical evidence (Chen et al. 2009, 2016a). We summarize in supplementary material two insights to further clarify its advantage. Method Fashion MNIST Source Target CE 90.90 0.002 73.73 0.086 HSIC 91.03 0.003 76.56 0.034 Hα(e) 91.10 0.013 75.48 0.069 I α(x; e) 91.17 0.040 76.79 0.040 Table 3: Test accuracy (%) on Fashion-MNIST Learning under covariate shift. We first compare the performances of cross entropy (CE) loss, HSIC loss with our error entropy Hα(e) loss and I α(x; e) loss under covariate shift. Following (Greenfeld and Shalit 2020), the source data is the Fashion-MNIST dataset (Xiao, Rasul, and Vollgraf 2017), and images which are rotated by an angle θ sampled from a uniform distribution over [ 20 , 20 ] constitute the target data. The neural network architecture is set as: there are 2 convolutional layers (with, respectively, 16 and 32 filters of size 5 5) and 1 fully connected layers. We add batch normalization and max-pooling layer after each convolutional layer. We choose Re LU activation, batch size 128 and the Adam optimizer (Kingma and Ba 2014). For I α(x; e) and Hα(e), we set α = 2. For the HSIC loss, we take the same hyper-parameters as in (Greenfeld and Shalit 2020). The results are summarized in Table 3. Our Hα(e) performs comparably to HSIC, but our Iα(x; e) improves performances in both source and target domains. Learning in noisy environment. We select the widely used bike sharing data set (Fanaee-T and Gama 2014) in UCI repository, in which the task is to predict the number of hourly bike rentals based on the following features: holiday, weekday, workingday, weathersit, temperature, feeling temperature, wind speed and humidity. Consisting of 17, 379 samples, the data was collected over two years, and can be partitioned by year and season. Early studies suggest that this data set contains covariate shift due to the change of time (Subbaswamy, Schulam, and Saria 2019). We use the first three seasons samples as source data and the forth season samples as target data. The model of choice is a multi-layered perceptron (MLP) with three hidden layer of size 100, 100 and 10 respectively. We compare our I α(x; e) and Hα(e) with mean square error (MSE), mean absolutely error (MAE) and HSIC loss, assuming y is contaminated with additive noise as ey = y + ϵ. We consider two common non-Gaussian noises with the noise level controlled by parameter ρ: the Laplace noise ϵ Laplace(0, ρ); and the shifted exponential noise ϵ = ρ(1 η) with η exp(1). We use batch-size of 32 and the Adam optimizer. We compared our I α(x; e) and Hα(e) against MSE loss, MAE loss and HSIC loss. Fig. 3 demonstrates the averaged performance gain (or loss) of different loss functions over MSE loss in 10 independent runs. In most of cases, I α(x; e) improves the most. HSIC is not robust to Laplacian noise, whereas MAE performs poorly under shifted exponential noise. On the other hand, Hα(e) also obtained a consistent performance gain over MSE, which further corroborates our theoretical arguments. 0.0 0.2 0.4 0.6 0.8 1.0 Noise level ρ MAE HSIC Hα(e) (a) Laplacian 0.0 0.2 0.4 0.6 0.8 1.0 Noise level ρ MAE HSIC Hα(e) (b) Shifted exponential Figure 3: Comparisons of models trained with MSE, MAE, HSIC loss, I α(x; e) and Hα(e). Each bar denotes the relative improvement (RI) on prediction accuracy over MSE. Subspace Outlier Detection Our third application is the outlier detection, in which we aim to identify data objects that do not fit well with the general data distributions (in Rd). Despite diverse paradigms, such as the density-based methods (Breunig et al. 2000) and the distance-based methods (Bay and Schwabacher 2003), have been developed so far, they usually suffer from the notorious curse of dimensionality (Keller, Muller, and Bohm 2012). In fact, the principle of concentration of distance (Beyer et al. 1999) reveals that for a query point p, its relative distance (or contrast) D to the farthest point and the nearest point converges to 0 with the increase of dimensionality d: lim d Dmax Dmin Dmin 0. (13) This means that the discriminative power between the nearest and the farthest neighbor becomes rather poor in highdimensional space. On the other hand, real data often contains irrelevant attributes or noises. This phenomenon degrades further the performance of most existing outlier detection methods if the outliers are hidden in subspaces of all given attributes (Kriegel, Schubert, and Zimek 2008). Therefore, the subspace methods that explore lower-dimensional subspace in order to discover outliers provide a promising avenue. Empirical evidence suggests that, the larger the deviation of this subspace from the mutual independence in each dimension, the higher the potential that it is easier to distinguish outliers from normal observations (M uller et al. 2009). Therefore, measuring the total amount of dependence of a subspace becomes a pivotal aspect. To this end, we plug our dependence measure (either T α or D α) into a commonly used Apriori subspace search scheme (Nguyen et al. 2013) to assess the quality of each subspaces (the larger the better). Next, we detect outliers with a widely-used Local Outlier Factor (LOF) method (Breunig et al. 2000) on the top 10 subspaces with highest dependence score. We use again the AUC to quantitatively evaluate outlier detection results of our method against three competitors: 1) LOF in full-space; 2) Feature Bagging (FB) (Lazarevic and Kumar 2005) that applies LOF on randomly selected subspaces; and 3) LOF on subspaces generated by IDD. We omit the results of LOF on the subspaces generated by C-KDM due to relatively poor performance. We Data Set (N d) LOF FB IDD T α D α Diabetes (568 8) 0.68 0.63 0.55 0.68 0.68 breast W (683 9) 0.46 0.53 0.76 0.71 0.71 Cardio (1831 21) 0.68 0.65 0.62 0.75 0.70 Musk (3062 166) 0.42 0.40 0.67 0.73 0.83 Speech (3686 400) 0.36 0.38 0.50 0.58 0.54 Table 4: Outlier detection results (AUC score) on real data. The first and second best performances are in bold and underlined, respectively. test on 5 publicly available data sets from the Outlier Detection Data Sets (ODDS) library (Rayana 2016). These data cover a wide range of number of samples (N) and data dimensionality (d). The prevalence of anomalies ranges from 1.65% (in speech) to 35% (in breast W). The results are reported in Table 4. As can be seen, our T α and D α achieve remarkable performance gain over LOF on all attributes, especially when the d is large. Under a Wilcoxon signed rank test (Demˇsar 2006) with 0.05 significance level, both T α and D α significantly outperform LOF. This observation corroborates our motivation of reliable subspace search. By contrast, the random subspace selection scheme in FB does not show obvious advantage, and the subspace quality generated by IDD is lower than ours. Understanding the Dynamics of Learning of CNNs Understanding the dynamics of learning of deep neural networks (especially CNNs) has received increasing attention in recent years (Shwartz-Ziv and Tishby 2017; Saxe et al. 2018). From an information-theoretic perspective, most studies aim to unveil fundamental properties associated with the dynamics of learning of CNNs by monitoring the mutual information between pairwise layers across training epochs (Yu et al. 2020). Different from the layer-level dependence, we provide here an alternative way to quantitatively analyze the dynamics of learning in a feature-level. Specifically, suppose there are Nt feature maps in the t-th convolutional layer. Let us denote them by C1, C2, , CNt. We use two quantities to capture the dependence in feature maps: 1) the pairwise dependence between the i-th feature map and the j-th feature map (i.e., I α(Ci; Cj)); 2) the total dependence among all feature maps (i.e., T α(C1, C2, , CNt)). We train a standard VGG-16 (Simonyan and Zisserman 2015) on CIFAR-10 (Krizhevsky 2009) with SGD optimizer from scratch. The T α in different layers across different training epochs is illustrated in Fig. 4. There is an obvious increasing trend for T α in all layers during the training, i.e., the total amount of dependence amongst all feature maps continuously increases as the training moves on, until approaching to the value of nearly 1. A similar observation is also made by HSIC. Note that, the co-adaptation phenomenon has also been observed in fully connected layers and eventually inspired the Dropout (Hinton et al. 2012). Fig. 5 shows the histogram of I α in each layer. Similar to the general trend of T α, we observed that the most frequent values of I α change from nearly 0 to nearly 1. Moreover, 2 4 6 8 10 12 VGG-16 Layer ID 0 10 30 50 70 100 Figure 4: T α across training epochs for different convolutional layers. 0 20 40 60 80 100 epoch Log-histogram of I α 0 20 40 60 80 100 epoch Figure 5: The histogram of I α (in log-scale) in (a) averaged from the 1st to the 7th CONV layers; and (b) averaged from the 8th to 13th CONV layer. Feature maps reach high dependence with less than 20 epochs of training in lower layers, but need more than 100 epochs in upper layers. such movement in lower layers occurs much earlier than that in upper layers. This behavior is in line with (Raghu et al. 2017), which states that the neural networks first train and stabilize lower layers and then move to upper layers. We suggest two measures to quantify from data the dependence of multiple variables with arbitrary dimensions. Distinct from previous efforts, our measures avoid the estimation of the data distributions and are applicable to all dependence scenarios (for i.i.d. data). The proposed measures more easily (e.g., with less data) identify independence and discover complex dependence patterns. Moreover, the differentiable property enables us to design new loss functions for training neural networks. In terms of specific applications, we demonstrated that the new loss min I α(x; e) is robust against both covariate shift and non-Gaussian noises. We also provided an alternative way to analyze the dynamics of learning of CNNs based on the dependence amongst feature maps, and obtained meaningful observations. In the future, we will explore other properties of our measures. We are interested in applying them to other challenging problems, such as disentangled representation learning with variational autoencoders (VAEs) (Kingma and Welling 2014). We also performed a preliminary investigation on a new robust loss, termed the deep deterministic information bottleneck (DIB), in the supplementary material. Acknowledgements This work was funded in part by the Research Council of Norway grant no. 309439 SFI Visual Intelligence and grant no. 302022 DEEPehr, and in part by the U.S. ONR under grant N00014-18-1-2306 and DARPA under grant FA945318-1-0039. References Abdallah, S. A.; and Plumbley, M. D. 2012. A measure of statistical complexity based on predictive information with application to finite spin systems. Physics Letters A 376(4): 275 281. Austin, T. 2018. Multi-variate correlation and mixtures of product measures. ar Xiv preprint ar Xiv:1809.10272 . Bach, F. R.; and Jordan, M. I. 2002. Kernel independent component analysis. Journal of Machine Learning Research 3: 1 48. Bay, S. D.; and Schwabacher, M. 2003. Mining distancebased outliers in near linear time with randomization and a simple pruning rule. In ACM SIGKDD, 29 38. Belghazi, M. I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; and Hjelm, D. 2018. Mutual information neural estimation. In ICML, 531 540. Beyer, K.; Goldstein, J.; Ramakrishnan, R.; and Shaft, U. 1999. When is nearest neighbor meaningful? In International Conference on Database Theory, 217 235. Springer. Bhatia, R. 2006. Infinitely divisible matrices. The American Mathematical Monthly 113(3): 221 235. Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; and Sander, J. 2000. LOF: identifying density-based local outliers. In ACM SIGMOD, 93 104. Chen, B.; Hu, J.-C.; Zhu, Y.; and Sun, Z.-Q. 2009. Information theoretic interpretation of error criteria. Acta Automatica Sinica 35(10): 1302 1309. Chen, B.; Xing, L.; Xu, B.; Zhao, H.; and Principe, J. C. 2016a. Insights into the robustness of minimum error entropy estimation. IEEE Transactions on Neural Networks and Learning Systems 29(3): 731 737. Chen, L.; Qu, H.; Zhao, J.; Chen, B.; and Principe, J. C. 2016b. Efficient and robust deep learning with correntropyinduced loss function. Neural Computing and Applications 27(4): 1019 1031. Chung, F. R.; Graham, R. L.; Frankl, P.; and Shearer, J. B. 1986. Some intersection theorems for ordered sets and graphs. Journal of Combinatorial Theory, Series A 43(1): 23 37. Demˇsar, J. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research 7(Jan): 1 30. Erdogmus, D.; and Principe, J. C. 2002. An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems. IEEE Transactions on Signal Processing 50(7): 1780 1786. Fanaee-T, H.; and Gama, J. 2014. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence 2(2-3): 113 127. Fernandes, A. D.; and Gloor, G. B. 2010. Mutual information is critically dependent on prior assumptions: would the correct estimate of mutual information please identify itself? Bioinformatics 26(9): 1135 1139. Greenfeld, D.; and Shalit, U. 2020. Robust learning with the Hilbert-Schmidt independence criterion. In ICML, 3759 3768. Gretton, A.; Borgwardt, K. M.; Rasch, M. J.; Sch olkopf, B.; and Smola, A. 2012. A kernel two-sample test. Journal of Machine Learning Research 13(Mar): 723 773. Gretton, A.; Bousquet, O.; Smola, A.; and Sch olkopf, B. 2005. Measuring statistical dependence with Hilbert Schmidt norms. In International Conference on Algorithmic Learning Theory, 63 77. Springer. Gretton, A.; Fukumizu, K.; Teo, C. H.; Song, L.; Sch olkopf, B.; and Smola, A. J. 2008. A kernel statistical test of independence. In Neur IPS, 585 592. Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. R. 2012. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580 . Joe, H. 1989. Relative entropy measures of multivariate dependence. Journal of the American Statistical Association 84(405): 157 164. Josse, J.; and Holmes, S. 2016. Measuring multivariate association and beyond. Statistics Surveys 10: 132. Keller, F.; Muller, E.; and Bohm, K. 2012. Hi CS: High contrast subspaces for density-based outlier ranking. In ICDE, 1037 1048. IEEE. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980 . Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Variational Bayes. In ICLR. Kraskov, A.; St ogbauer, H.; and Grassberger, P. 2004. Estimating mutual information. Physical Review E 69(6): 066138. Kriegel, H.-P.; Schubert, M.; and Zimek, A. 2008. Anglebased outlier detection in high-dimensional data. In ACM SIGKDD, 444 452. Krizhevsky, A. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto. Lazarevic, A.; and Kumar, V. 2005. Feature bagging for outlier detection. In ACM SIGKDD, 157 166. Lopez-Paz, D.; Hennig, P.; and Sch olkopf, B. 2013. The randomized dependence coefficient. In Neur IPS, 1 9. Madiman, M.; and Tetali, P. 2010. Information inequalities for joint distributions, with interpretations and applications. IEEE Transactions on Information Theory 56(6): 2699 2713. Marbach, D.; Costello, J. C.; K uffner, R.; Vega, N. M.; Prill, R. J.; Camacho, D. M.; Allison, K. R.; Kellis, M.; Collins, J. J.; and Stolovitzky, G. 2012. Wisdom of crowds for robust gene network inference. Nature Methods 9(8): 796 804. M uller, E.; Assent, I.; G unnemann, S.; Krieger, R.; and Seidl, T. 2009. Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data. In ICDM, 377 386. IEEE. Nguyen, H. V.; M uller, E.; Vreeken, J.; Efros, P.; and B ohm, K. 2014. Multivariate maximal correlation analysis. In ICML, 775 783. Nguyen, H. V.; M uller, E.; Vreeken, J.; Keller, F.; and B ohm, K. 2013. CMI: An information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In SIAM International Conference on Data Mining, 198 206. SIAM. P oczos, B.; Ghahramani, Z.; and Schneider, J. 2012. Copulabased kernel dependency measures. In ICML, 1635 1642. Principe, J. C.; Xu, D.; Zhao, Q.; and Fisher, J. W. 2000. Learning from examples with information theoretic criteria. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 26(1-2): 61 77. Quionero-Candela, J.; Sugiyama, M.; Schwaighofer, A.; and Lawrence, N. D. 2009. Dataset shift in machine learning. The MIT Press. Raghu, M.; Gilmer, J.; Yosinski, J.; and Sohl-Dickstein, J. 2017. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Neur IPS, 6076 6085. Rao, M.; Chen, Y.; Vemuri, B. C.; and Wang, F. 2004. Cumulative residual entropy: a new measure of information. IEEE Transactions on Information Theory 50(6): 1220 1228. Rayana, S. 2016. ODDS Library. http://odds.cs.stonybrook. edu. Stony Brook University, Department of Computer Sciences. R enyi, A. 1959. On measures of dependence. Acta Mathematica Academiae Scientiarum Hungarica 10(3-4): 441 451. Reshef, D. N.; Reshef, Y. A.; Finucane, H. K.; Grossman, S. R.; Mc Vean, G.; Turnbaugh, P. J.; Lander, E. S.; Mitzenmacher, M.; and Sabeti, P. C. 2011. Detecting novel associations in large data sets. Science 334(6062): 1518 1524. Romano, S.; Chelly, O.; Nguyen, V.; Bailey, J.; and Houle, M. E. 2016. Measuring dependency via intrinsic dimensionality. In ICPR, 1207 1212. IEEE. Sanchez Giraldo, L. G.; Rao, M.; and Principe, J. C. 2014. Measures of entropy from data using infinitely divisible kernels. IEEE Transactions on Information Theory 61(1): 535 548. Saxe, A. M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B. D.; and Cox, D. D. 2018. On the Information Bottleneck Theory of Deep Learning. In ICLR. Schmid, F.; and Schmidt, R. 2007. Multivariate extensions of Spearman s rho and related statistics. Statistics & Probability Letters 77(4): 407 416. Seth, S.; and Pr ıncipe, J. C. 2012. Conditional association. Neural computation 24(7): 1882 1905. Shwartz-Ziv, R.; and Tishby, N. 2017. Opening the black box of deep neural networks via information. ar Xiv preprint ar Xiv:1703.00810 . Simonyan, K.; and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR. Subbaswamy, A.; Schulam, P.; and Saria, S. 2019. Preventing failures due to dataset shift: Learning predictive models that transport. In AISTATS, 3118 3127. PMLR. Sugiyama, M.; Suzuki, T.; Nakajima, S.; Kashima, H.; von B unau, P.; and Kawanabe, M. 2008. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics 60(4): 699 746. Sun, T. 1975. Linear dependence structure of the entropy space. Information and Control 29(4): 337 68. Sz ekely, G. J.; Rizzo, M. L.; Bakirov, N. K.; et al. 2007. Measuring and testing dependence by correlation of distances. The Annals of Statistics 35(6): 2769 2794. Wang, Y.; Romano, S.; Nguyen, V.; Bailey, J.; Ma, X.; and Xia, S.-T. 2017. Unbiased multivariate correlation analysis. In AAAI, 2754 2760. Watanabe, S. 1960. Information theoretical analysis of multivariate correlation. IBM Journal of research and development 4(1): 66 82. Wilson, G.; and Cook, D. J. 2020. A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology (TIST) 11(5): 1 46. Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747 . Yu, S.; Sanchez Giraldo, L. G.; Jenssen, R.; and Principe, J. C. 2019. Multivariate Extension of Matrix-based Renyi s α-order Entropy Functional. IEEE Transactions on Pattern Analysis and Machine Intelligence . Yu, S.; Wickstrøm, K.; Jenssen, R.; and Principe, J. C. 2020. Understanding convolutional neural networks with information theory: An initial exploration. IEEE Transactions on Neural Networks and Learning Systems . Zhang, Q.; Filippi, S.; Gretton, A.; and Sejdinovic, D. 2018. Large-scale kernel methods for independence testing. Statistics and Computing 28(1): 113 130.