# sifting_common_information_from_many_variables__2ea794ce.pdf

Sifting Common Information from Many Variables

Greg Ver Steeg, Shuyang Gao, Kyle Reing, Aram Galstyan University of Southern California Information Sciences Institute gregv@isi.edu, gaos@usc.edu, reing@usc.edu, galstyan@isi.edu

Measuring the relationship between any pair of variables is a rich and active area of research that is central to scientiﬁc practice. In contrast, characterizing the common information among any group of variables is typically a theoretical exercise with few practical methods for high-dimensional data. A promising solution would be a multivariate generalization of the famous Wyner common information, but this approach relies on solving an apparently intractable optimization problem. We leverage the recently introduced information sieve decomposition to formulate an incremental version of the common information problem that admits a simple ﬁxed point solution, fast convergence, and complexity that is linear in the number of variables. This scalable approach allows us to demonstrate the usefulness of common information in high-dimensional learning problems. The sieve outperforms standard methods on dimensionality reduction tasks, solves a blind source separation problem that cannot be solved with ICA, and accurately recovers structure in brain imaging data.

1 Introduction One of the most fundamental measures of the relationship between two random variables, X1, X2, is given by the mutual information, I(X1; X2). While mutual information measures the strength of a relationship, the common information provides a concrete representation, Y , of the information that is shared between two variables. According to [Wyner, 1975], if Y contains the common information between X1, X2, then we should have I(X1; X2|Y ) = 0, i.e., Y makes the variables conditionally independent. We can extend this idea to many variables using the multivariate generalization of mutual information called total correlation [Watanabe, 1960], so that conditional independence is equivalent to the condition TC(X1, . . . , Xn|Y ) = 0 [Xu et al., 2013]. The most succinct Y that has this property represents the multivariate common information in X but ﬁnding such a Y in general is a challenging, unsolved problem. The main contribution of this paper is to show that the concept of common information, long studied in information

theory for applications like distributed source coding and cryptography [Kumar et al., 2014], is also a useful concept for machine learning. Machine learning applications have been overlooked due to the intractability of recovering common information for high-dimensional problems. We propose a concrete and tractable algorithmic approach to extracting common information by exploiting a connection with the recently introduced information sieve decomposition [Ver Steeg and Galstyan, 2016]. The sieve decomposition works by searching for a single latent factor that reduces the conditional dependence in the data as much as possible. Then the data is transformed to remove this dependence and the remainder information trickles down to the next layer. The process is repeated until all the dependence has been extracted and the remainder contains nothing but independent noise. Thm. 3.3 connects the latent factors extracted by the sieve to a measure of common information. Our second contribution is to show that under the assumptions of linearity and Gaussianity this optimization has a simple ﬁxed-point solution (Eq. 6) with fast convergence and computational complexity linear in the number of variables. Although our ﬁnal algorithm is limited to the linear case, extracting common information is an unsolved problem and our approach represents a logical ﬁrst step in exploring the value of common information for machine learning. We offer suggestions for generalizing the method. Our ﬁnal contribution is to validate the usefulness of our approach on some canonical machine learning problems. While PCA ﬁnds components that explain the most variation, the sieve discovers components that explain the most dependence, making it a useful complement for exploratory data analysis. Common information can be used to solve a natural class of blind source separation problems that are impossible to solve using independent component analysis (ICA) due to the presence of Gaussian sources. Finally, we show that common information outperforms standard approaches for dimensionality reduction and recovering structure in f MRI data.

2 Preliminaries Using standard notation [Cover and Thomas, 2006], capital Xi denotes a continuous random variable whose instances are denoted in lowercase, xi. We abbreviate multivariate random variables, X X1:n X1, . . . , Xn, with an associated probability density function, p X(X1 = x1, . . . , Xn = xn), which

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

is typically abbreviated to p(x), with vectors in bold. We will index different groups of multivariate random variables with superscripts, Xk, as deﬁned in Fig. 1. We let X0 denote the original observed variables and we omit the superscript for readability when no confusion results. Entropy is deﬁned as H(X) hlog 1/p(x)i, where we use brackets for expectation values. Conditional multivariate mutual information, or conditional total correlation, is deﬁned as the Kullback-Leibler divergence between the joint distribution, and the one that is conditionally independent.

TC(X|Y ) DKL

i=1 p(xi|y)

This quantity is non-negative and zero if and only if all the Xi s are independent conditioned on Y. TC(X) can be obtained by dropping the conditioning on Y in the expression above. In other words, TC(X) = 0 if and only if the variables are (unconditionally) independent. If Y were the hidden source of all dependence in X, then TC(X|Y ) = 0. Therefore, we consider the problem of searching for a factor Y that minimizes TC(X|Y ). In the statement of the theorems we make use of shorthand notation, TC(X; Y ) TC(X) TC(X|Y ), which is the reduction of TC after conditioning on Y . This notation mirrors the deﬁnition of mutual information between two groups of random variables, X and Y , as the reduction of uncertainty in one variable, given information about the other, I(X; Y ) = H(X) H(X|Y ).

3 Extracting Common Information

For Y to contain the common information in X, we need TC(X|Y ) = 0. Instead of enforcing the condition that TC(X|Y ) = 0 and looking for the most succinct Y that satisﬁes this condition, as Wyner does [Wyner, 1975], we consider the dual formulation where we minimize TC(X|Y1, . . . , Yr) subject to constraints on r, the size of the state space [Op t Veld and Gastpar, 2016a]. This optimization can be written equivalently as follows.

min y=f(x) TC(X1, . . . , Xn|Y1, . . . , Yr) (2)

We will show in Thm. 3.3 that an upper bound for this objective is obtained by solving a sequence of optimization problems of the following form, indexed by k.

min yk=f(xk 1) TC(Xk 1 1 , . . . , Xk 1 nk |Yk) (3)

The deﬁnition of Xk is discussed next, but the high level idea is that we have reduced the difﬁcult optimization over many latent factors in Eq. 2 to a sequence of optimizations with a single latent factor in Eq. 3. Each optimization gives us a tighter upper bound on our original objective, Eq. 2.

Incremental Decomposition We begin with some input data, X, and then construct Y1 to minimize TC(X|Y1). After doing so, we would like to transform the original data into the remainder information, X1, so that we can use the same optimization to learn a factor, Y2, that extracts more common

X0 : X1 . . . Xn X1 : X1 1 . . . X1 n Y1 X2 : X2 1 . . . X2 n Y 2 1 Y2 Xk : Xk 1 . . . Xk n Y k 1 Y k 2 Yk

Figure 1: (a) This diagram describes one layer of the sieve. Yk is some function of the Xk 1 i s that is optimized to capture dependence. The remainder, Xk i contains information that is not explained by Yk. (b) We summarize the naming convention for multiple layers.

information that was not already captured by Y1. We diagram this construction at layer k in Fig. 1 and show in Thm 3.1 the requirements for constructing the remainder information. The result of this procedure is encapsulated in Cor. 3.2 which says that we can iterate this procedure and TC(X|Y1, . . . , Yk) will be reduced at each layer until it reaches zero and Y captures all the common information. Theorem 3.1. Incremental decomposition of common information For Yk a function of Xk 1, the following decomposition holds,

TC(Xk 1) = TC(Xk) + TC(Xk 1; Yk), (4)

if the remainder information Xk satisﬁes two properties. 1. Invertibility: there exist functions g, h so that xk 1 i = g(xk i , yk) and xk i = h(xk 1 i , yk) 2. Remainder contains no information about Yk: 8i, I(Xk i ; Yk) = 0

Proof. We refer to Fig. 1(a) for the structure of the graphical model. We set X X1, . . . , Xn, Y and we will write X1:n to pick out all terms except Y . Expanding the deﬁnition of TC(X; Y ), the equality in Eq. 4 becomes

TC( X) TC(X|Y ) = log p( x, y) Q i p(xi|y) p(y)p(x|y) Q

We have to show that this quantity equals zero under the assumptions speciﬁed. First, we multiply the fraction by one by putting Q

i p( xi|y) terms in the numerator and denominator. After applying condition (2) that I( Xi; Y ) = 0, we can remove two terms leaving the following. log p( x|y) Q

i p(xi|y) p(x|y) Q

If condition (1) of the theorem is satisﬁed, then, conditioned on y, xi and xi are related by a deterministic formula. We can see from applying the change of variables formula for probability distributions that the terms in this expression cancel, leaving us with log 1 = 0, as we intended to prove.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

The decomposition above was originally introduced for discrete variables as the information sieve [Ver Steeg and Galstyan, 2016]; the continuous formulation we introduce here replaces the ﬁrst condition used in the original statement with an analogous one that is appropriate for continuous variables. Note that because we can always ﬁnd non-negative solutions for TC(Xk 1; Yk), it must be that TC(Xk) TC(Xk 1). In other words, the remainder information is more independent than the input data. This is consistent with the intuition that the sieve is sifting out the common information at each layer. Corollary 3.2. Iterative decomposition of TC With a hierarchical representation where each Yk is a function of Xk 1 and Xk is the remainder information as deﬁned in Thm 3.1, TC(X) = TC(Xr) + Pr k=1 TC(Xk 1; Yk). This follows from repeated application of Eq. 4. TC(X) is a constant that depends on the data. For high-dimensional data, it is impossible to measure TC(X), but by learning latent factors extracting progressively more dependence, we get a sequence of better bounds. Theorem 3.3. Decomposition of common information For the sieve decomposition, the following bound holds.

TC(X|Y1:r) TC(Xr) = TC(X)

k=1 TC(Xk 1; Yk)

Proof. The equality comes from Cor. 3.2.

TC(X1:n|Y1:r)

= log p(x1:n|y1:r) Qn i=1 p(xi|y1:r)

' = log p(xr 1:n|yr 1:r) Qn i=1 p(xr i |yr 1:r)

= log p(xr 1:n, yr 1:r) Qn i=1 p(xr i ) Qr k=1 p(yr k)

Qn i=1 p(xr i ) Qr k=1 p(yr k) p(yr 1:r) Qn i=1 p(xr i |yr 1:r)

= TC(Xr) + log Qn i=1 p(xr i ) Qr k=1 p(yr k) p(yr 1:r) Qn i=1 p(xr i |yr 1:r)

= TC(Xr) TC(Y r 1:r)

i=1 I(Xr i ; Y r 1:r)

The ﬁrst line follows from the the change of variables formula for the transformation connecting layer r to the input layer. On the second line we multiply by 1 and re-arrange, collecting terms in the next two lines. The last inequality follows from non-negativity of TC and mutual information.

Recalling that TC(X; Y ) = TC(X) TC(X|Y ), Thm. 3.3 shows how the sum of terms optimized in Eq. 3 provide a successively tighter upper bound on the objective of Eq. 2. In other words, as we keep adding and optimizing latent factors they reduce the conditional TC until all the common information has been extracted.

Optimization It remains to solve the optimization in Eq. 3. For now we drop the k index and focus on minimizing TC(X|Y ) for a single factor Y . To get a simple and tractable solution to this non-convex problem, we consider a further simpliﬁcation where X is Gaussian with covariance matrix and inverse covariance = 1. If X is Gaussian and

Y s dependence on X is linear and Gaussian, the joint distribution over X, Y will also be Gaussian. We write out the optimization in Eq. 3 under this condition.

min Y |X N (w x, 2)

i=1 H(Xi|Y ) H(X|Y ) (5)

Two immediate simpliﬁcations are apparent. First, this objective is invariant to scaling of Y . Any solution with , w would be equivalent to a scaled solution s , sw. Therefore, without loss of generality we set = 1. Second, we invoke Bayes rule to see H(X|Y ) = H(Y |X) + H(X) H(Y ) where the ﬁrst two terms on the right hand side are constants with respect to the optimization. We re-write the optimization accordingly.

min Y |X N (w x,1)

i=1 H(Xi|Y ) + H(Y )

The objective is invariant to translation of the marginals, so w.l.o.g. we also set h Xii = h Y i = 0. Deﬁne a nonlinear change of variables in terms of the correlation coefﬁcient,

i = Xi Y / q X2 i Y 2 . To translate between w and ,

we also note, ( w)i = Xi Y , w = q X2 i Y 2 and Y 2 = 1/(1 > ) = w> w + 1. This leads to the following optimization, neglecting some constants.

min Y |X N (w x,1)

i=1 1/2 log(1 2 i ) 1/2 log(1 > )

Next, we set derivatives with respect to each i to zero.

@ i T C(X|Y ) = i/(1 i)2 + /(1 > ) = 0.

Now we use the identities to translate back to a ﬁxed-point equation in terms of w and rearrange.

wi = h Xi Y i h X2 i ih Y 2i h Xi Y i2 (6)

Interestingly, we arrive at a novel nonlinear twist on the classic Hebbian learning rule [Baldi and Sadowski, 2015]. If Xi and Y ﬁre together they wire together (i.e. correlations lead to stronger weights), but this objective strongly prefers correlations that are nearly maximal, in which case the denominator becomes small and the weight becomes large. This optimization of TC(X|Y ) for continuous random variables X and Y is, to the best of our knowledge, the ﬁrst tractable approach except for a special case discussed by [Op t Veld and Gastpar, 2016a]. Also note that although we used , in the derivation, the solution does not require us to calculate these computationally intensive quantities. A ﬁnal consideration is the construction of remainder information (i.e., how to get Xk from Xk 1 and Y in Fig. 1) consistent with the requirements in Thm. 3.1. In the discrete formulation of the sieve, constructing remainder information is a major problem that ultimately imposes a bottleneck on its usefulness because the state space of remainder information can grow quickly. In the linear case, however, the construction of remainder information is a simple linear transformation reminiscent of incremental PCA. We deﬁne the remainder information with a linear transformation,

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Xk i = Xk 1 i h Xk 1 i Yki/h Y 2 k i Yk. This transformation is clearly invertible (condition (i)), and it can be checked that h Xk i Yki = 0 which implies I(Xk i ; Yk) = 0 (condition (2)).

Generalizing to the Non-Gaussian, Nonlinear Case The solution for the linear, Gaussian case is more ﬂexible than it looks. We do not actually have to require that the data, X, is drawn from a jointly normal distribution to get meaningful results. It turns out that if each of the individual marginals is Gaussian, then the expression for mutual information for Gaussians provides a lower bound for mutual information [Foster and Grassberger, 2011]. Also, the objective (Eq. 2) is invariant under invertible transformations of the marginals [Cover and Thomas, 2006]. Therefore, to ensure that the optimization that we solved (Eq. 5) is a lower bound for the optimization of interest, we should transform the marginals to be individually Gaussian distributed. Several nonlinear, parametric methods to Gaussianize one-dimensional data exist, including a recent method that works well for long-tailed data [Goerg, 2014]. Alternatively, a nonparametric approach is to Gaussianize data based on the rank statistics [Van der Waerden, 1952]. Finally, [Singh and Pøczos, 2017] study information measures for a large family of distributions that can be nonparametrically transformed into normal distributions.

4 Implementation Details

A Single Layer A concrete implementation of one layer of the sieve transformation is straightforward and the algorithm is summarized in Alg. 1. Our implementation is available online [Ver Steeg, 2016]. The minimal preprocessing of the data is to subtract the mean of each variable. Optionally, further Gaussianizing preprocessing can be applied. Our ﬁxed point optimization requires us to start with some weights, w0 and we iteratively update wt using Eq. 6 until we reach a ﬁxed point. This only guarantees that we ﬁnd a local optima so we typically run the optimization 10 times and take the solution with the highest value of the objective. We initialize w0 i to be drawn from a normal with zero mean and scale 1/ p

nσ2xi. We scale each w0 i by the standard deviation of each marginal so that one variable does not strongly dominate the random initialization, y = w0 x.

Data: Data matrix, N iid samples of vectors, x 2 Rn

Result: Weights, w, so that y = w x optimizes TC(X; Y ) and remainder information, x. Subtract mean from each column of data; Initialize wi N(0, 1/(pnσxi)); while not converged do

Calculate y = w x for each sample ; Calculate moments from data, h Xi Y i, h Y 2i, h X2 i i; 8i, wi h Xi Y i/(h Y 2ih X2 i i h Xi Y i2); end For each column of data, i, return xi = xi h Xi Y i

Algorithm 1: Algorithm to learn one layer of the sieve.

The iteration proceeds by estimating marginals and then applying Eq. 6. Estimating the covariance at each step is the

Figure 2: This is the generative model used for synthetic experiments. Each observed variable, Xi = Zpa(i) + "i combines its parent, Zpa(i), with Gaussian noise.

main computational burden, but the steps are all linear. If we have N samples and n variables, then we calculate labels for each data point, y = w x, which amounts to N dot products of vectors with length n. Then we calculate the covariance, h Xi Y i, which amounts to n dot products of vectors of length N. These are the most intensive steps and could be easily sped up using GPUs or mini-batches if N is large. Convergence is determined by checking when changes in the objective of Eq. 5 fall below a certain threshold, 10 8 in our experiments.

Multiple Layers After training one layer of the sieve, it is trivial to take the remainder information and feed it again through Alg. 1. While our optimization in Eq. 5 formally involved a probabilistic function, we take the ﬁnal learned function to be deterministic, y = w x, as required by Thm. 3.1. Each layer contributes TC(Xk 1; Yk) in our decomposition of TC(X), so we can stop when these contributions become negligible. This occurs when the variables in Xk become independent. In that case, TC(X|Y1:k) = TC(Xk) = 0 and since TC(Xk) TC(Xk; Yk+1), we get no more positive contributions from optimizing TC(Xk; Yk+1).

We begin with some benchmark results on a synthetic model. We use this model to show that the sieve can uniquely recover the hidden sources, while other methods fail to do so.

Data Generating Model For the synthetic examples, we consider data generated according to a model deﬁned in Fig. 2. We have m sources, each with unit variance, Zj N(0, 1). Each source has k children and the children are not overlapping. Each channel is an additive white Gaussian noise (AWGN) channel deﬁned as Xi = Zpa(i) + "i. The noise has some variance that may be different for each observed variable, "i N(0, 2 i ). Each channel can be characterized as having a capacity, Ci = 1/2 log(1 + 1/ 2 i ) [Cover and Thomas, 2006], and we deﬁne the total capacity, C = Pk i=1 Ci. For experiments, we set C to be some constant, and we set the noise so that the fraction, Ci/C, allocated to each variable, Xi, is drawn from the uniform distribution over the simplex.

Empirical Convergence Rates We examine how quickly the objective converges by plotting the error at the t-th iteration. The error is deﬁned as the difference between TC at each iteration and the ﬁnal TC. We take the ﬁnal value of TC to be the value obtained when the magnitude of changes falls below 10 14. We set C = 1 for these experiments. In Fig. 3, we look at convergence for a few different settings of the generative model and see linear rates of convergence (where error is plotted on a log scale, as is conventional for convergence plots),

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Figure 3: Empirical error plots on synthetic data show linear rates of convergence. We obtained similar results on real-world data.

with a coefﬁcient that seems to depend on problem details. The slowest rate of convergence comes from data where each Xi is generated from an independent normal distribution (i.e., there is no common information).

Recover a Single Source from Common Information As a ﬁrst test of performance, we consider a simple version of the model in Fig. 2 in which we have just a single source and we have k observed variables that are noisy copies of the source. For this experiment, we set total capacity to C = 4. By varying k, we are spreading this capacity across a larger number of noisier variables. We use the sieve to recover a single latent factor, Y , that captures as much of the dependence as possible (Eq. 2), and then we test how close this factor is to the true source, Z, using Pearson correlation. We also compare to various other standard methods: PCA [Halko et al., 2011], ICA [Hyv arinen and Oja, 2000], Non-Negative Matrix Factorization (NMF) [Lin, 2007], Factor Analysis (FA) [Cattell, 1952], Local Linear Embedding (LLE) [Roweis and Saul, 2000], Isomap [Tenenbaum et al., 2000], Restricted Boltzmann Machines (RBMs) [Hinton and Salakhutdinov, 2006], and k-Means [Sculley, 2010]. All methods were run using implementations in the scikit library [Pedregosa et al., 2011]. Looking at the results in Fig. 4(a), we see that for a small number of variables almost any technique sufﬁces to recover the source. As the number of variables rises, however, intuitively reasonable methods fail and only the sieve maintains high performance. The ﬁrst component of PCA, for instance, is the projection with the largest variance but it can be shown that by changing the scale of the noise in different directions, this component can be made to point in any direction. Unlike PCA, the sieve is invariant under scale transformations of each variable. Error bars are produced by looking at the standard deviation of results over 10 randomly generated datasets. Some error bars are smaller than the plot markers. Besides being the most accurate method, the sieve also has the smallest variance.

5.1 Source Separation with Common Information In the generative model in Fig. 2, we have m independent sources that are each Gaussian distributed. We could imagine applying an orthonormal rotation, R, to the vector of sources and call these Zj = P

k Rjk Zk. Because of the Gaussianity of

the original sources, Z also represent m independent Gaussian sources. We can write down an equivalent generative model for the Xi s, but each Xi now depends on all the Z (i.e., Xi = P

j R 1 i,j Zj + "i). From a generative model perspective, our original model is unidentiﬁable and therefore independent component analysis cannot recover it [Hyv arinen and Oja, 2000]. On the other hand, the original generating model is special because the common information about the Xi s are localized in invidivual sources, while in the rotated model, you need to combine information from all the sources to predict any individual Xi. The sieve is able to uniquely recover the true sources because they represent the optimal way to sift out common information. To measure our ability to recover the independent sources in our model, we consider a model with m = 10 sources and varying numbers of noisy observations. The results are shown in Fig. 4(b). We learn 10 layers of the sieve and check how well Y1, . . . , Y10 recover the true sources. We also specify 10 components for the other methods shown for comparison. As predicted, ICA does not recover the independent sources. While the generative model is in the class described by Factor Analysis (FA), there are many FA models that are equally good generative models of the data. In other words, FA suffers from an identiﬁability problem that makes it impossible to uniquely pick out the correct model [Shalizi, 2013]. In contrast, common information provides a simple and effective principle for uniquely identifying the true sources.

Exploratory Data Analysis The ﬁrst component of PCA explains the most variance in the data, and the weights of the ﬁrst component are often used in exploratory analysis to understand the semantics of discovered factors. Analogously, the ﬁrst component of the sieve extracts the largest source of common information. In Fig. 5 we compare the top components learned by the sieve on the Olivetti faces dataset to those learned by PCA. The sieve may be more practical for extracting components if data is high dimensional since its complexity is linear in the number of variables while PCA is quadratic. Like PCA, we can also use the sieve for reconstructing data from a small number of learned factors. Note that the sieve transform is invertible so that Xi = X1 i + h X0 i Y1i/h Y 2 1 i Y1. If we have a sieve transformation with r layers, then we can continue this expansion as follows.

Xi = Xr i +

k=1 h Xk+1 i Yki/h Y 2 k i Yk

If we knew the remainder information, Xr i , this reconstruction would be perfect. However, we can simply set the Xr i = 0 and we will get a prediction for Xi based only on the learned factors, Y , as in Fig. 5.

Source Separation in f MRI Data To demonstrate that our approach is practical for blind source separation in a more realistic scenario, we applied the sieve to recover spatial brain components from f MRI data. This data is generated according to a synthetic but biologically motivated model that incorporates realistic spatial modes and heterogeneous temporal

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

(a) (b) (c)

Figure 6: Colors represent different spatial components. (a) The spatial map of 27 components used to generate f MRI data. (b) 27 spatial components recovered by the information sieve. (c) 27 spatial components recovered by ICA where components visualize the recovered mixing matrix.

signals [Erhardt et al., 2012]. We show in Fig. 6(b) that we recover components that match well with the true spatial components. For comparison, we show ICA s performance in Fig. 6(c) which looks qualitatively worse. ICA s poor performance for recovering spatial MRI components is known and various extensions have been proposed to remedy this [Allen et al., 2012]. This preliminary result suggests that the concept of common information may be a more useful starting point than independent components as an underlying principle for brain imaging analysis.

5.2 Dimensionality Reduction

The sieve can be viewed as a dimensionality reduction (DR) technique. Therefore, we apply various DR methods to two

Gaussian kernel to compare the classiﬁcation accuracy after dimensionality reduction. The two datasets we studied were GISETTE and MADELON and consist of 5000 and 500 dimensions respectively. For each method and dataset, we learn a low-dimensional representation on training data and then transform held-out test data and report the classiﬁcation accuracy on that. The results are summarized in Fig. 7. For the GISETTE dataset, we see factor analysis, the sieve, and PCA performing the best, producing low dimensional representations with similar quality using a relatively small number of dimensions. For the MADELON dataset, the sieve representation gives the best accuracy with factor analysis and PCA resulting in accuracy drops of about ﬁve and ten percent respectively. Interestingly, all three techniques peak at ﬁve dimensions, which was intended to be the correct number of latent factors embedded in this dataset [Guyon et al., 2004].

6 Related Work Although the sieve is linear, the information objective that is optimized is nonlinear so the sieve substantially differs from methods like PCA. Superﬁcially, the sieve might seem related to methods like Canonical Correlation Analysis (CCA) that seek to ﬁnd a Y that makes X and Z independent, but that method requires some set of labels, Z. One possibility would be to make Z a copy of X, so that Y is reducing dependence between X and a copy of itself [Wang et al., 2010]. However, this objective differs from common information as can be seen by considering the case where X consists of independent

Figure 4: Each source is compared to the best match of the components returned by each method. The score is the average of the absolute Pearson correlations. Each point is a mean score over ten randomly generated datasets, with error bars representing standard deviation. (a) We attempt to recover a single hidden source variable from varying numbers of observed variables. We set C = 4 and use 500 samples. (b) We attempt blind source separation for ten independent, hidden source variables given varying numbers of observed variables per source. We set C = 12 and use 10000 samples.

Pixel weight

Figure 5: (Left) The top 6 components for the Olivetti faces dataset using the information sieve (top) and PCA (bottom). Red and blue correspond to negative and positive weights respectively. (Right) We take Olivetti faces (middle row) and then try to reconstruct them using the top 20 components from the sieve (top row) or PCA (bottom row).

standard datasets and use a Support Vector Machine with a G i k l t th l iﬁ ti f

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Figure 7: (a) Validation accuracy for GISETTE dataset (b) Validation accuracy for MADELON dataset. All the scores are averaged by running 20 trials.

variables. In that case the common information within X is zero, but X and its copy still have dependence. The concept of common information has largely remained restricted to information-theoretic contexts [Xu et al., 2013; Wyner, 1975; Kumar et al., 2014; Op t Veld and Gastpar, 2016a; Op t Veld and Gastpar, 2016b]. The common information in X that is about some variable, Z, is called intersection information and is also an active area of research [Grifﬁth et al., 2014]. Insofar as the sieve reduces the dependence in the data, it can be seen as an alternate approach to independent component analysis [Comon, 1994] that is more directly comparable to least dependent component analysis [St ogbauer et al., 2004]. As an information theoretic learning framework, the sieve could be compared to the information bottleneck [Tishby et al., 2000], which also has an interesting Gaussian counterpart [Chechik et al., 2005]. The bottleneck requires labeled data to deﬁne its objective. In contrast, the sieve relies on an unsupervised objective that ﬁts more closely into a recent program for decomposing information in high-dimensional data [Ver Steeg and Galstyan, 2014; Ver Steeg and Galstyan, 2015; Ver Steeg and Galstyan, 2016], except that work focused on discrete latent factors. The sieve could be viewed as a new objective for projection pursuit [Friedman, 1987] based on common information. The sieve stands out from standard pursuit algorithms in two ways. First, an information based orthogonality criteria for subsequent projections naturally emerges and, second, new factors may depend on factors learned at previous layers (note that in Fig. 1 each learned latent factor is included in the remainder information that is optimized over in the next step). More broadly, the sieve can be viewed as a new approach to unsupervised deep representation learning [Bengio et al., 2013; Hinton and Salakhutdinov, 2006]. In particular, our setup can be directly viewed as an auto-encoder with a novel objective [Bengio et al., 2007]. From that point of view, it is clear that the sieve can also be directly leveraged for unsupervised density estimation [Dinh et al., 2014].

7 Conclusion We introduced a new scheme for incrementally extracting common information from high-dimensional data. The foundation of our approach is an efﬁcient information theoretic optimization that ﬁnds latent factors that capture as much information about multivariate dependence in the data as possible. With a practical method for extracting common information from

high-dimensional data, we were able to explore new applications of common information in machine learning. Besides promising applications for exploratory data analysis and dimensionality reduction, common information seems to provide a compelling approach to blind source separation. While the results here relied on assumptions of linearity and Gaussianity, the invariance of the objective under nonlinear marginal transforms, a common ingredient in deep learning schemes, suggests a straightforward path to generalization that we leave to future work. The greedy nature of the sieve construction may be a limitation so another potential direction would be to jointly optimize several latent factors at once. Sifting out common information in high-dimensional data provides a practical and distinctive new principle for unsupervised learning.

Acknowledgments

GV thanks Sanjoy Dasgupta, Lawrence Saul, and Yoav Freund for encouraging exploration of the linear, Gaussian case for decomposing multivariate information. This work was supported in part by DARPA grant W911NF-12-1-0034 and IARPA grant FA8750-15-C-0071.

[Allen et al., 2012] Elena A Allen, Erik B Erhardt, Yonghua Wei, Tom Eichele, and Vince D Calhoun. Capturing inter-subject variability with group independent component analysis of fmri data: a simulation study. Neuroimage, 59(4), 2012. [Baldi and Sadowski, 2015] Pierre Baldi and Peter Sadowski. The ebb and ﬂow of deep learning: a theory of local learning. ar Xiv preprint ar Xiv:1506.06472, 2015. [Bengio et al., 2007] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19:153, 2007. [Bengio et al., 2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, 35(8):1798 1828, 2013. [Cattell, 1952] Raymond B Cattell. Factor analysis: an introduction and manual for the psychologist and social scientist. 1952. [Chechik et al., 2005] Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables. In Journal of Machine Learning Research, pages 165 188, 2005.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

[Comon, 1994] Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994.

[Cover and Thomas, 2006] Thomas M Cover and Joy A Thomas. Elements of information theory. Wiley-Interscience, 2006.

[Dinh et al., 2014] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014.

[Erhardt et al., 2012] Erik B Erhardt, Elena A Allen, Yonghua Wei, Tom Eichele, and Vince D Calhoun. Simtb, a simulation toolbox for fmri data under a model of spatiotemporal separability. Neuroimage, 59(4):4160 4167, 2012.

[Foster and Grassberger, 2011] David V Foster and Peter Grassberger. Lower bounds on mutual information. Phys. Rev. E, 83(1):010101, 2011.

[Friedman, 1987] Jerome H Friedman. Exploratory projection pursuit. Journal of the American statistical association, 82(397), 1987.

[Goerg, 2014] Georg M. Goerg. The lambert way to gaussianize heavy-tailed data with the inverse of tukey s h transformation as a special case. The Scientiﬁc World Journal: Special Issue on Probability and Statistics with Applications in Finance and Economics, 2014.

[Grifﬁth et al., 2014] Virgil Grifﬁth, Edwin KP Chong, Ryan G James, Christopher J Ellison, and James P Crutchﬁeld. Intersection information based on common randomness. Entropy, 16(4):1985 2000, 2014.

[Guyon et al., 2004] Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the nips 2003 feature selection challenge. In Advances in neural information processing systems, pages 545 552, 2004.

[Halko et al., 2011] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288, 2011.

[Hinton and Salakhutdinov, 2006] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 507, 2006.

[Hyv arinen and Oja, 2000] Aapo Hyv arinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4):411 430, 2000.

[Kumar et al., 2014] G Ramesh Kumar, Cheuk Ting Li, and Abbas El Gamal. Exact common information. In Information Theory (ISIT), 2014 IEEE International Symposium on. IEEE, 2014.

[Lin, 2007] Chih-Jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10):2756 2779, 2007.

[Op t Veld and Gastpar, 2016a] Giel Op t Veld and Michael C Gastpar. Caching gaussians: Minimizing total correlation on the gray wyner network. In 50th Annual Conference on Information Systems and Sciences (CISS), 2016.

[Op t Veld and Gastpar, 2016b] Giel J Op t Veld and Michael C Gastpar. Total correlation of gaussian vector sources on the graywyner network. In Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on, pages 385 392. IEEE, 2016.

[Pedregosa et al., 2011] Fabian Pedregosa, Ga el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg,

et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825 2830, 2011. [Roweis and Saul, 2000] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323 2326, 2000. [Sculley, 2010] David Sculley. Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web, pages 1177 1178. ACM, 2010. [Shalizi, 2013] Cosma Shalizi. Advanced data analysis from an elementary point of view, 2013. [Singh and Pøczos, 2017] Shashank Singh and Barnab as Pøczos. Nonparanormal information estimation. ar Xiv preprint ar Xiv:1702.07803, 2017. [St ogbauer et al., 2004] Harald St ogbauer, Alexander Kraskov, Sergey A Astakhov, and Peter Grassberger. Least-dependentcomponent analysis based on mutual information. Physical Review E, 70(6):066123, 2004. [Tenenbaum et al., 2000] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319 2323, 2000. [Tishby et al., 2000] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. ar Xiv:physics/0004057, 2000. [Van der Waerden, 1952] BL Van der Waerden. Order tests for the two-sample problem and their power. In Indagationes Mathematicae (Proceedings), volume 55, pages 453 458. Elsevier, 1952. [Ver Steeg and Galstyan, 2014] Greg Ver Steeg and Aram Galstyan. Discovering structure in high-dimensional data through correlation explanation. Advances in Neural Information Processing Systems (NIPS), 2014. [Ver Steeg and Galstyan, 2015] Greg Ver Steeg and Aram Galstyan. Maximally informative hierarchical representations of highdimensional data. In Artiﬁcial Intelligence and Statistics (AISTATS), 2015. [Ver Steeg and Galstyan, 2016] Greg Ver Steeg and Aram Galstyan. The information sieve. In International Conference on Machine Learning (ICML), 2016. [Ver Steeg, 2016] Greg Ver Steeg. Linear information sieve code. http://github.com/gregversteeg/Linear Sieve, 2016. [Wang et al., 2010] Meihong Wang, Fei Sha, and Michael I Jordan. Unsupervised kernel dimension reduction. In Advances in Neural Information Processing Systems, pages 2379 2387, 2010. [Watanabe, 1960] Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of research and development, 4(1):66 82, 1960. [Wyner, 1975] Aaron D Wyner. The common information of two dependent random variables. Information Theory, IEEE Transactions on, 21(2):163 179, 1975. [Xu et al., 2013] Ge Xu, Wei Liu, and Biao Chen. Wyner s common information: Generalizations and a new lossy source coding interpretation. ar Xiv preprint ar Xiv:1301.2237, 2013.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)