# statistical_insights_into_hsic_in_high_dimensions__43db3c6c.pdf Statistical Insights into HSIC in High Dimensions Tao Zhang School of Information Management and Engineering Shanghai University of Finance and Economics Shanghai 200433, China dalizhangt@163.com Yaowu Zhang School of Information Management and Engineering Mo E Key Laboratory of Interdisciplinary Research of Computation and Economics Shanghai University of Finance and Economics Shanghai 200433, China zhang.yaowu@mail.shufe.edu.cn Tingyou Zhou School of Data Sciences Zhejiang University of Finance and Economics Hangzhou 310018, China zhoutingyou@zufe.edu.cn Measuring the nonlinear dependence between random vectors and testing for their statistical independence is a fundamental problem in statistics. One of the most popular dependence measures is the Hilbert-Schmidt independence criterion (HSIC), which has attracted increasing attention in recent years. However, most existing works have focused on either fixed or very high-dimensional covariates. In this work, we bridge the gap between these two scenarios and provide statistical insights into the performance of HSIC when the dimensions grow at different rates. We first show that, under the null hypothesis, the rescaled HSIC converges in distribution to a standard normal distribution. Then we provide a general condition for the HSIC based tests to have nontrivial power in high dimensions. By decomposing this condition, we illustrate how the ability of HSIC to measure nonlinear dependence changes with increasing dimensions. Moreover, we demonstrate that, depending on the sample size, the covariate dimensions and the dependence structures within covariates, the HSIC can capture different types of associations between random vectors. We also conduct extensive numerical studies to validate our theoretical results. 1 Introduction Let x = (X1, . . . , Xp)T Rp and y = (Y1, . . . , Yq)T Rq be two random vectors. The problems of measuring nonlinear dependence between x and y and testing for their independence are fundamental and have a wide range of applications in statistics and machine learning. For example, dependence measures can be applied in feature screening (Fan et al., 2020b), model checking (Clarke et al., 2018), graphical models (Maathuis et al., 2018), and causal inference (Imbens & Rubin, 2015). Corresponding author 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Many statistical tools have been developed to measure the dependence between random variables. Classical measures such as Pearson correlation (Pearson, 1920), Spearman s ρ (Spearman, 1904) and Kendall s τ (Kendall, 1938) can only capture linear or monotonic dependence and may fail to detect complex associations. For univariate covariates, several rank-based measures have been proposed to overcome this limitation, such as those by Hoeffding (1948), Blum et al. (1961), Bergsma & Dassios (2014) and Chatterjee (2021). For multivariate covariates, various nonparametric tests have been introduced to quantify arbitrary associations, including the distance correlation (Székely et al., 2007), the projection correlation (Zhu et al., 2017), the mutual information (Berrett & Samworth, 2019), the kernel canonical correlation (Bach & Jordan, 2002), the constrained covariance (Gretton et al., 2005), and the Hilbert-Schmidt independence criterion (HSIC, Gretton et al., 2007). Among these measures, the distance correlation and the HSIC are perhaps the most widely used and studied in both statistics and machine learning. For instance, Li et al. (2012) applied the distance correlation to perform feature screening. Fan et al. (2020a) generalized the distance correlation to a conditional dependence measure under a factor model setting. Albert et al. (2022) developed an adaptive test of independence by aggregating HSIC measures. Pfister et al. (2018) defined a d-variable HSIC to characterize the mutual dependence among d random variables. It is worth noting that the distance correlation is equivalent to HSIC with a specific kernel choice (Sejdinovic et al., 2013). High-dimensional settings pose new challenges and opportunities for dependence measures. Székely & Rizzo (2013) suggested to estimate the distance correlation with the U-statistic theory to avoid bias accumulations when the dimension is high. Ramdas et al. (2015) found that the power of kernel and distance based independence tests decreases polynomially with increasing dimension against some fair alternatives. Zhu et al. (2020) proved that, when both p and q grow much faster than the sample size n, the sample distance correlation and HSIC can only detect componentwise linear dependences. Gao et al. (2021) established a general condition for the distance correlation based test to have power approaching 1. They also showed that the distance correlation can capture certain pure nonlinear relationship when p = q = o(n1/2) under a specific alternative hypothesis. In this paper, we aim to provide statistical insights into the HSIC in high dimensions, motivated by the increasing popularity of HSIC and the prevalence of high-dimensional data. We focus on the asymptotic distributions of HSIC under both null and alternative hypotheses when the dimensionality grows with the sample size. Under the null hypothesis, we prove that the rescaled HSIC converges in distribution to a standard normal distribution. We also derive a general condition for the HSIC based tests to have power asymptotically approaching one. These results extend those of Gao et al. (2021) on the distance correlation. Moreover, our analysis is more comprehensive than theirs. By decomposing the general condition, we reveal that HSIC can detect different types of dependences, depending on the dimensionality and sample orders. This sheds light on the performance of HSIC in high dimensions, which has been overlooked in the literature. For illustrative purposes, we let x and y be two random vectors with zero mean and identity covariance matrix. When HSIC is used to test for statistical independence between them, we obtain the following conditions for nontrivial power: If p and q is fixed as n , and there only exists the conditional mean of x s s-th polynomial given y, then p(s 1) must grow slower than n. If both p and q as n , and there only exists the covariance between x s s1-th polynomial and y s s2-th polynomial, then p(s1 1)q(s2 1) must grow slower than n. These two conditions reveal significant statistical insights and have important implications in practice. For example, when the data dimension q is small and p is larger than the sample size n, according to the first condition, to ensure p(s 1) smaller than n, s can only be 1. In other words, the HSIC can only measure the conditional mean of x given y. When both p and q are larger than n, according to the second condition, to ensure p(s1 1)q(s2 1) smaller than n, both s1 and s2 must be 1. That is, the HSIC can only measure the covariance between x and y. In summary, our main contributions are: 1. We generalize the results of Gao et al. (2021) for distance correlation to HSIC, which is a more flexible and widely used measure of dependence. 2. We show that HSIC can capture different types of dependences, depending on the dimensionality and sample orders, which has rarely been realized in the literature before. 3. In contrast to Zhu et al. (2020), who only focused on the case when both p and q grow much faster than n, our results characterize a full picture of the performance of HSIC based test in high dimensions. The rest of the paper is organized as follows. We give some preliminaries about the HSIC in Section 2. Then we present the asymptotic null distribution as well as the power analysis for high-dimensional HSIC in Section 3. In Section 4, we provide statistical insights into the HSIC through connecting the association type with the sample size, the high dimensionality, and the dependence structures within covariates. We conduct comprehensive numerical studies in Sections 5 and conclude this paper with a brief discussion in Section 6. All technical details are relegated to the Supplement. 2 Preliminaries In this section, we give a brief review on HSIC measures proposed by Gretton et al. (2007), which is derived from the notion of cross-covariance operator Baker (1973). We assume that X X Rp and Y Y Rq are random vectors taking values in Euclidean spaces respectively. Let F and G be the reproducing kernel Hilbert spaces (RKHSs) on X and Y, with associated kernels K( , ) and L( , ), respectively. Then the cross-covariance operator Cxy associated to F and G is defined as the operator mapping from G to F, such that for all f F and g G, f, Cxyg F = cov{f(x), g(y)}. Instead of using the largest singular value of this operator, Gretton et al. (2007) suggested to adopt the squared Hilbert-Schmidt norm, which admits a closed form expression HSIC(x, y) = E{K(x1, x2)L(y1, y2)} + E{K(x1, x2)}E{L(y1, y2)} 2E [E{K(x1, x2) | x1}E{L(y1, y2) | y1}] , where (x1, y1) and (x2, y2) are independent copies of (x, y). When the RKHSs, F and G, are well chosen, (e.g., characteristic, Sriperumbudur et al., 2010), the HSIC can be served as an independence criterion in the sense that it is nonnegative and equals zero if and only if x and y are independent. To normalize the HSIC to range from zero to one, we follow Zhu et al. (2020) to define the squared Hilbert-Schmidt correlation as h Corr2(x, y) def= HSIC(x, y)HSIC 1/2(x, x)HSIC 1/2(y, y). (1) In addition, similar to Zhu et al. (2020) and Albert et al. (2022), we focus on the kernels that can be written in the form of K(x1, x2) = k( x1 x2 /γx) and L(y1, y2) = l( y1 y2 /γy), where k( ) and l( ) are some real-valued functions defined in [0, ), γx and γy are the bandwidth parameters. This kind of kernel is very commonly used in the literature and includes many positive-definite kernel such as the Gaussian kernel, the Laplacian kernel, the rational quadratic kernel, and the Kernel generating Sobolev spaces, etc. For example, when k(x) = exp( x2), it corresponds to the Gaussian kernel, and when k(x) = exp( x), it corresponds to the Laplacian kernel. One may refer to Genton (2001) for more examples and discussions about isotropic kernels. 3 Asymptotic Properties in High Dimensions At the sample level, with the random sample {(xi, yi), i = 1, ..., n}, we estimate the HSIC with U-statistics, HSICn(x, y) def= P (i1,i2) K(xi1, xi2)L(yi1, yi2) n(n 1) 2 P (i1,i2,i3) K(xi1, xi2)L(yi1, yi3) n(n 1)(n 2) (i1,i2,i3,i4) K(xi1, xi2)L(yi3, yi4) n(n 1)(n 2)(n 3) , (2) where the indexes in the summands, i.e., (i1, i2), (i1, i2, i3) and (i1, i2, i3, i4), are all distinctive from each other. Then the squared Hilbert-Schmidt correlation h Corr2(x, y) can be estimated as h Corr2 n(x, y) def= HSICn(x, y)HSIC 1/2 n (x, x)HSIC 1/2 n (y, y). We now discuss how to use the sample squared Hilbert-Schmidt correlation to distinguish between the null hypothesis H0 : x is independent of y, and the alternative hypothesis H1 : x is not independent of y. To implement the HSIC based test, it is required to study the asymptotic distributions for h Corr2 n(x, y) under the null hypothesis. When the dimensions are fixed, Gretton et al. (2007) showed that n h Corr2 n(x, y) converges in distribution to a weighted sum of independent chi-squared random variables as long as x is independent of y. We now study its asymptotic null distribution in high dimensions. Before that, we define the two quantities Hx(x1, x2) and Hy(y1, y2) as Hx(x1, x2) def= K(x1, x2) E{K(x1, x2) | x1} E{K(x1, x2) | x2} + E{K(x1, x2)}, Hy(y1, y2) def= L(y1, y2) E{L(y1, y2) | y1} E{L(y1, y2) | y2} + E{L(y1, y2)}. (3) We further define another two quantities Gx(x1, x2) and Gy(y1, y2) as Gx(x1, x2) def= E{Hx(x1, x3)Hx(x2, x3) | x1, x2}, Gy(y1, y2) def= E{Hy(y1, y3)Hy(y2, y3) | y1, y2}. Then we summarize the asymptotic null distribution for h Corr2 n(x, y) in the following Theorem. Theorem 1. Assume the kernels are symmetric with finite fourth moment, i.e., K(x1, x2) = K(x2, x1), L(y1, y2) = L(y2, y1), E{K4(x1, x2)} < and E{L4(y1, y2)} < . Further assume that p + q , E{H4 x(x1, x2)}E{H4 y(y1, y2)} n{HSIC(x, x)HSIC(y, y)}2 0, and E{G2 x(x1, x2)}E{G2 y(y1, y2)} {HSIC(x, x)HSIC(y, y)}2 0, (4) as n . Then under the null hypothesis, we have 2 1/2n h Corr2 n(x, y) d N(0, 1). Different from that when the dimensions are fixed, Theorem 1 reveals that, under certain conditions, n h Corr2 n(x, y) is asymptotically normal in high dimensions. The asymptotic normality makes the limiting null distribution tractable in practice. It greatly expedites the implementation of HSIC based tests because no additional permutations are required to decide critical values. We remark here that the assumptions imposed in Theorem 1 to ensure the asymptotic normality are generally very mild. For the restrictions on the kernels, many commonly used kernels, including Gaussian and Laplacian kernels, are symmetric with finite fourth moment. For the assumption in (4), it is used to restrict the weak dependence within covariates in x and y. It holds true when x and y follow the multivariate normal distributions with bounded eigenvalues. It also holds true when x and y are m-dependent random sequences (Definition 9.1 of Das Gupta, 2008). One may refer to Gao et al. (2021) for more discussions about this assumption. Next, we explore the power performance of the HSIC based test in high dimensions. As discussed in Section 2, we restrict our attentions to the isotropic kernels. Specifically, in the remaining of this paper, we consider K(x1, x2) and L(y1, y2) to be in the form of k( x1 x2 /γx) and l( y1 y2 /γy), respectively. Because of the kernel form, we assume without loss of generality that Ex = 0 and Ey = 0. We denote by z def= z/γz, where z Rd can be either x or y. We say that f(n) g(n) if f(n) = O{g(n)} and g(n) = O{f(n)}. Before summarizing the asymptotic power performance, we make the following assumptions: (A1) There exists some κz > 0 such that E{ z 2 E( z 2)}2k E(z 1 Tz 2)2k d kκz for all k N+. (A2) Let k0(x) = k(x1/2) and l0(y) = l(y1/2). The first and second derivatives of k0( ) and l0( ) are uniformly bounded away from zero to infinity around E x 1 x 2 2 and E y 1 y 2 2, respectively. We remark here that Assumption (A1) is used to restrict the dependence structures within the coordinates of z. Because z = z/γz and Ez = 0, we have Ez = 0 as well. Hence, both z 2 E( z 2) and z 1 Tz 2 are sums of d random variables with mean zero. Their standardized versions may converge in distribution to standard normal distributions under mild conditions (e.g., mixing conditions) by applying the central limit theorem (CLT). For instance, Voln y (1989) established a CLT for non-stationary mixing processes. Alternatively, there exists some κz > 0 such that dκz/2{ z 2 E( z 2)} and dκz/2z 1 Tz 2 both converge in distribution to normal random variables. Then Condition (A1) is satisfied. Assumption (A2) imposes some conditions on the kernels, which covers the Gaussian and Laplacian kernels. Moreover, it also makes several requirements on the bandwidth parameters as well. For example, suppose the Gaussian kernel is used, i.e., k0(z) = exp( z), when γz is small enough such that E z 1 z 2 2 , then we have k 0(E z 1 z 2 2) 0, which violates Assumption (A2). Theoretically, the bandwidth parameter γz can be chosen from a wide range of values, as long as it satisfies condition (A2). In practice, we use the median of z1 z2 as a default value for γz. When it has the same magnitude with E z1 z2 , Assumption (A2) can be easily satisfied for many commonly used kernels. Moreover, as demonstrated in the simulated examples in the Supplementary Material, our method exhibits robustness across various selections of γz. With these two assumptions, we are ready to provide the asymptotic power performance of the HSIC based test in high dimensions. Theorem 2. Assume (A1) and (A2) hold true. Then under the alternative hypothesis, if n1/2h Corr2(x, y) as n , we have n h Corr2 n(x, y) in probability. Theorem 2 reveals that, as long as the dependence measured by h Corr2(x, y) is not too small, n h Corr2 n(x, y) converges to infinity in probability. This, together with Theorem 1, guarantee that the HSIC based test can have nontrivial power in high dimensions. However, as the dimension increases, the signal measured by h Corr2(x, y) may decay to zero, which makes the test lose power in high dimensions. To gain more insights about how the power is influenced by the dimensions, it is desired to connect h Corr2(x, y) with the dimensionality, as well as the association types between x and y. 4 Statistical Insights in High Dimensions We have discussed the asymptotic properties of HSIC based tests in high dimensions, and demonstrated that it can detect the alternative hypothesis as long as the signal strength measured by h Corr2(x, y) does not decay to zero too fast. However, it is still unclear that how h Corr2(x, y) is influenced by the dimensions. To gain more insights into the performance in high dimensions, in this section, we study what kind of association types the HSIC can detect under different relationships between the sample size and the covariate dimensions. To this end, we expand h Corr2(x, y) defined in (1) to gain some insights. Because HSIC(x, y) characterizes the dependence between x and y, while HSIC(x, x) and HSIC(y, y) are the normalizing factors, we expand HSIC(x, y) and calculate the orders for HSIC(x, x) and HSIC(y, y). We first calculate the orders for HSIC(x, x) and HSIC(y, y) at the population level as the dimensions diverge to infinity. The results are summarized in the following Proposition. Proposition 1. Under Assumptions (A1) and (A2), HSIC(z, z) d κz if d . According to Proposition 1, the condition n1/2h Corr2(x, y) , which guarantees the HSIC based test would have power approaching 1 in high dimensions, will boil down to n1/2pκx/2qκy/2HSIC(x, y) . Next, we expand HSIC(x, y) at the population level. Before that, we introduce some additional notations. Similar to Shao & Zhang (2014), we define MD2(x | y) as MD2(x | y) def= E{(x1 Ex)T(x2 Ex)L(y1, y2)}, where (x1, y1) and (x2, y2) are independent copies of (x, y). We show in Lemma 1 that, MD2(x | y) = 0 if and only if E(x | y) = 0. That is, MD2(x | y) measures the degree of conditional mean of x given y, which quantifies the difference between E(x | y) and Ex. Lemma 1. Assume the kernel L(y1, y2) = l( y1 y2 /γy) is a characteristic, bounded continuous, and real-valued positive definite function, then MD2(x | y) = 0 if and only if E(x | y) = 0 almost surely. We remark here that characteristic condition in Lemma 1 is used to guarantee that the HSIC is nonnegative and equals zero if and only if x and y are independent. One may refer to (Sriperumbudur et al., 2010) for more details about this condition. Furthermore, we let k(i) 0 be the i-th derivative of k0( ) evaluated at E x 1 x 2 2, and l(j) 0 is the j-th derivative of l0( ) evaluated at E y 1 y 2 2. Let x n be the n-th kronecker power of x, which is defined as x n = x x (n 1), x 1 = x, and denotes the Kronecker product. In addition, we use F to represent the Frobenius norm of a matrix. Then we expand HSIC(x, y) in the following Theorem. Theorem 3. Assume (A1) and (A2) hold true. Then under the alternative hypothesis, 1. when p and q remains fixed as n , if E(x t | y) = E(x t) hold true for all t < s for some s N+, then HSIC(x, y) = O(p sκx/2), and HSIC(x, y) = k(s) 0 X a!a!c! MD2(x c x 2a | y) + o(p sκx/2), 2. when p and q as n , if cov(x t1, y t2) = 0 only when t1 s1 and t2 s2 for some s1, s2 N+, then HSIC(x, y) = O(p s1κx/2q s2κy/2), and HSIC(x, y) = X k(s1) 0 ( 2)c1 a1!a1!c1! l(s2) 0 ( 2)c2 cov{ x 2a1x c1, y 2a2y c2 T} +o(p s1κx/2q s2κy/2). Theorem 3 enables us to gain some insights into the HSIC in high dimensions. First of all, because when 2a + c = s, x c x 2a is the s-th polynomial for x, and MD2(x c x 2a | y) measures the departure from the conditional mean independence of x given y. Then the first part of Theorem 3 implies that when p and q remains fixed as n , if E(x t | y) = E(x t) hold true for all t < s for some s N+, the HSIC quantifies the departure from E(x s | y) = E(x s). In other words, when only one dimension of the covariates diverges to infinity, the HSIC first measures the conditional mean of x given y. If there is no such mean dependence, it turns to measure the conditional mean of x 2 given y, etc. Similarly, when p and q as n , the second part of Theorem 3 reveals that, the HSIC may first search for the covariance between x given y. If such covariance equals zero, it then turns to search for the covariance between x 2 and y, or the covariance between x and y 2, etc. Therefore, Theorem 3 characterizes the changing process of the ability to measure nonlinear dependence in high dimensions for HSIC. As a comparison, Zhu et al. (2020) only showed that, when both dimensions diverge much faster than the sample size, the HSIC can only capture the marginal linear dependence. From this point of view, our result is a generalization of theirs. When the HSIC is used to test for statistical independence, Theorem 2 and Proposition 1 ensure that, when n1/2pκx/2qκy/2HSIC(x, y) , the HSIC based test would have asymptotic power 1. To make this condition hold true, the first part of Theorem 3 implies that when p and q remains fixed as n , if there only exists the conditional mean of x s given y, it is required that p(s 1)κx = o(n) because HSIC(x, y) is of order O(p sκx/2) in this case. In other words, when n = O{p(s 1)κx}, the HSIC based test can have nontrivial power only when there exists the i-th polynomial mean dependence of x given y for some i < s. Similarly, when both p and q diverges as n , the second part of Theorem 3, when there only exists the covariance between x s1 and y s2, the test based on HSIC can have nontrivial power only when p(s1 1)κxq(s2 1)κy = o(n). Because κx represents the degree of dependence within covariates, Theorem 3 successfully connects the association type between x and y, with the sample size, the high dimensionality, and the dependence structures within covariates. We remark here that Gao et al. (2021) only justified that the test based on distance correlation can have power approaching 1 when p = q = o(n1/2) if there exist a pure nonlinear dependence between x and y. Therefore, their result is in spirit a special case of ours. 5 Numerical Studies 5.1 Simulations In this subsection, we conduct some simulation studies to validate our theoretical conclusions on the HSIC based test in high dimensions. In particular, in Example 1, we show that, as long as the dimensions are not very small, the normal approximation can approximate the null distribution quite well, although it requires the dimension to diverge to infinity in theory. In Example 2, we fix the dimension of y and increase the dimension of x, and investigate the finite sample power performance of the HSIC based tests under various mean dependence types. In Example 3, we increase the dimensions of x and y simultaneously, and inspect how the empirical power is influenced by the covariate dimensions under different relationships. Throughout the simulations, we choose two kinds of commonly used kernels to implement the HSIC based tests, i.e., Gaussian and Laplacian. For the Gaussian kernel, we set K(x1, x2) = exp{ x1 x2 2/(2γ2 x)} and L(y1, y2) = exp{ y1 y2 2/(2γ2 y)}. While for the Laplacian kernel, we set L(x1, x2) = exp( x1 x2 /γx) and L(y1, y2) = exp( y1 y2 /γy). For both choices of kernels, we set the bandwidth parameters to be γx = c0γm x and γy = c0γm y and vary c0 from 0.5 to 2. Here, γm z represents the median of { zi zj }1 i