# on_variational_bounds_of_mutual_information__59ecd24e.pdf On Variational Bounds of Mutual Information Ben Poole 1 Sherjil Ozair 1 2 A aron van den Oord 3 Alexander A. Alemi 1 George Tucker 1 Estimating and optimizing Mutual Information (MI) is core to many problems in machine learning; however, bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks, but the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On high-dimensional, controlled problems, we empirically characterize the bias and variance of the bounds and their gradients and demonstrate the effectiveness of our new bounds for estimation and representation learning. 1. Introduction Estimating the relationship between pairs of variables is a fundamental problem in science and engineering. Quantifying the degree of the relationship requires a metric that captures a notion of dependency. Here, we focus on mutual information (MI), denoted I(X; Y ), which is a reparameterization-invariant measure of dependency: I(X; Y ) = Ep(x,y) Mutual information estimators are used in computational neuroscience (Palmer et al., 2015), Bayesian optimal experimental design (Ryan et al., 2016; Foster et al., 2018), understanding neural networks (Tishby et al., 2000; Tishby & Zaslavsky, 2015; Gabri e et al., 2018), and more. In practice, estimating MI is challenging as we typically have access to 1Google Brain 2MILA 3Deep Mind. Correspondence to: Ben Poole . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Figure 1. Schematic of variational bounds of mutual information presented in this paper. Nodes are colored based on their tractability for estimation and optimization: green bounds can be used for both, yellow for optimization but not estimation, and red for neither. Children are derived from their parents by introducing new approximations or assumptions. samples but not the underlying distributions (Paninski, 2003; Mc Allester & Stratos, 2018). Existing sample-based estimators are brittle, with the hyperparameters of the estimator impacting the scientific conclusions (Saxe et al., 2018). Beyond estimation, many methods use upper bounds on MI to limit the capacity or contents of representations. For example in the information bottleneck method (Tishby et al., 2000; Alemi et al., 2016), the representation is optimized to solve a downstream task while being constrained to contain as little information as possible about the input. These techniques have proven useful in a variety of domains, from restricting the capacity of discriminators in GANs (Peng et al., 2018) to preventing representations from containing information about protected attributes (Moyer et al., 2018). Lastly, there are a growing set of methods in representation learning that maximize the mutual information between a learned representation and an aspect of the data. Specifically, given samples from a data distribution, x p(x), the goal is to learn a stochastic representation of the data pθ(y|x) that has maximal MI with X subject to constraints on the mapping (e.g. Bell & Sejnowski, 1995; Krause et al., 2010; Hu et al., 2017; van den Oord et al., 2018; Hjelm et al., 2018; Alemi et al., 2017). To maximize MI, we can compute gradients of a lower bound on MI with respect to the parameters θ of the stochastic encoder pθ(y|x), which may not require directly estimating MI. On Variational Bounds of Mutual Information While many parametric and non-parametric (Nemenman et al., 2004; Kraskov et al., 2004; Reshef et al., 2011; Gao et al., 2015) techniques have been proposed to address MI estimation and optimization problems, few of them scale up to the dataset size and dimensionality encountered in modern machine learning problems. To overcome these scaling difficulties, recent work combines variational bounds (Blei et al., 2017; Donsker & Varadhan, 1983; Barber & Agakov, 2003; Nguyen et al., 2010; Foster et al., 2018) with deep learning (Alemi et al., 2016; 2017; van den Oord et al., 2018; Hjelm et al., 2018; Belghazi et al., 2018) to enable differentiable and tractable estimation of mutual information. These papers introduce flexible parametric distributions or critics parameterized by neural networks that are used to approximate unkown densities (p(y), p(y|x)) or density ratios ( p(x|y) p(x) = p(y|x) In spite of their effectiveness, the properties of existing variational estimators of MI are not well understood. In this paper, we introduce several results that begin to demystify these approaches and present novel bounds with improved properties (see Fig. 1 for a schematic): We provide a review of existing estimators, discussing their relationships and tradeoffs, including the first proof that the noise contrastive loss in van den Oord et al. (2018) is a lower bound on MI, and that the heuristic bias corrected gradients in Belghazi et al. (2018) can be justified as unbiased estimates of the gradients of a different lower bound on MI. We derive a new continuum of multi-sample lower bounds that can flexibly trade off bias and variance, generalizing the bounds of (Nguyen et al., 2010; van den Oord et al., 2018). We show how to leverage known conditional structure yielding simple lower and upper bounds that sandwich MI in the representation learning context when pθ(y|x) is tractable. We systematically evaluate the bias and variance of MI estimators and their gradients on controlled highdimensional problems. We demonstrate the utility of our variational upper and lower bounds in the context of decoder-free disentangled representation learning on d Sprites (Matthey et al., 2017). 2. Variational bounds of MI Here, we review existing variational bounds on MI in a unified framework, and present several new bounds that trade off bias and variance and naturally leverage known conditional densities when they are available. A schematic of the bounds we consider is presented in Fig. 1. We begin by reviewing the classic upper and lower bounds of Barber & Agakov (2003) and then show how to derive the lower bounds of Donsker & Varadhan (1983); Nguyen et al. (2010); Belghazi et al. (2018) from an unnormalized variational distribution. Generalizing the unnormalized bounds to the multi-sample setting yields the bound proposed in van den Oord et al. (2018), and provides the basis for our interpolated bound. 2.1. Normalized upper and lower bounds Upper bounding MI is challenging, but is possible when the conditional distribution p(y|x) is known (e.g. in deep representation learning where y is the stochastic representation). We can build a tractable variational upper bound by introducing a variational approximation q(y) to the intractable marginal p(y) = R dx p(x)p(y|x). By multiplying and dividing the integrand in MI by q(y) and dropping a negative KL term, we get a tractable variational upper bound (Barber & Agakov, 2003): I(X; Y ) Ep(x,y) log p(y|x)q(y) KL(p(y) q(y)) Ep(x) [KL(p(y|x) q(y))] R, (1) which is often referred to as the rate in generative models (Alemi et al., 2017). This bound is tight when q(y) = p(y), and requires that computing log q(y) is tractable. This variational upper bound is often used as a regularizer to limit the capacity of a stochastic representation (e.g. Rezende et al., 2014; Kingma & Welling, 2013; Burgess et al., 2018). In Alemi et al. (2016), this upper bound is used to prevent the representation from carrying information about the input that is irrelevant for the downstream classification task. Unlike the upper bound, most variational lower bounds on mutual information do not require direct knowledge of any conditional densities. To establish an initial lower bound on mutual information, we factor MI the opposite direction as the upper bound, and replace the intractable conditional distribution p(x|y) with a tractable optimization problem over a variational distribution q(x|y). As shown in Barber & Agakov (2003), this yields a lower bound on MI due to the non-negativity of the KL divergence: I(X; Y ) = Ep(x,y) + Ep(y) [KL(p(x|y)||q(x|y))] Ep(x,y) [log q(x|y)] + h(X) IBA, On Variational Bounds of Mutual Information where h(X) is the differential entropy of X. The bound is tight when q(x|y) = p(x|y), in which case the first term equals the conditional entropy h(X|Y ). Unfortunately, evaluating this objective is generally intractable as the differential entropy of X is often unknown. If h(X) is known, this provides a tractable estimate of a lower bound on MI. Otherwise, one can still compare the amount of information different variables (e.g., Y1 and Y2) carry about X. In the representation learning context where X is data and Y is a learned stochastic representation, the first term of IBA can be thought of as negative reconstruction error or distortion, and the gradient of IBA with respect to the encoder p(y|x) and variational decoder q(x|y) is tractable. Thus we can use this objective to learn an encoder p(y|x) that maximizes I(X; Y ) as in Alemi et al. (2017). However, this approach to representation learning requires building a tractable decoder q(x|y), which is challenging when X is high-dimensional and h(X|Y ) is large, for example in video representation learning (van den Oord et al., 2016). 2.2. Unnormalized lower bounds To derive tractable lower bounds that do not require a tractable decoder, we turn to unnormalized distributions for the variational family of q(x|y), and show how this recovers the estimators of Donsker & Varadhan (1983); Nguyen et al. (2010). We choose an energy-based variational family that uses a critic f(x, y) and is scaled by the data density p(x): q(x|y) = p(x) Z(y)ef(x,y), where Z(y) = Ep(x) h ef(x,y)i . (3) Substituting this distribution into IBA (Eq. 2) gives a lower bound on MI which we refer to as IUBA for the Unnormalized version of the Barber and Agakov bound: Ep(x,y) [f(x, y)] Ep(y) [log Z(y)] IUBA. (4) This bound is tight when f(x, y) = log p(y|x)+c(y), where c(y) is solely a function of y (and not x). Note that by scaling q(x|y) by p(x), the intractable differential entropy term in IBA cancels, but we are still left with an intractable log partition function, log Z(y), that prevents evaluation or gradient computation. If we apply Jensen s inequality to Ep(y) [log Z(y)], we can lower bound Eq. 4 to recover the bound of Donsker & Varadhan (1983): IUBA Ep(x,y) [f(x, y)] log Ep(y) [Z(y)] IDV. (5) However, this objective is still intractable. Applying Jensen s the other direction by replacing log Z(y) = log Ep(x) ef(x,y) with Ep(x) [f(x, y)] results in a tractable objective, but produces an upper bound on Eq. 4 (which is itself a lower bound on mutual information). Thus evaluating IDV using a Monte-Carlo approximation of the expectations as in MINE (Belghazi et al., 2018) produces estimates that are neither an upper or lower bound on MI. Recent work has studied the convergence and asymptotic consistency of such nested Monte-Carlo estimators, but does not address the problem of building bounds that hold with finite samples (Rainforth et al., 2018; Mathieu et al., 2018). To form a tractable bound, we can upper bound the log partition function using the inequality: log(x) x a + log(a) 1 for all x, a > 0. Applying this inequality to the second term of Eq. 4 gives: log Z(y) Z(y) a(y) + log(a(y)) 1, which is tight when a(y) = Z(y). This results in a Tractable Unnormalized version of the Barber and Agakov (TUBA) lower bound on MI that admits unbiased estimates and gradients: I IUBA Ep(x,y) [f(x, y)] " Ep(x) ef(x,y) a(y) + log(a(y)) 1 To tighten this lower bound, we maximize with respect to the variational parameters a(y) and f. In the Info Max setting, we can maximize the bound with respect to the stochastic encoder pθ(y|x) to increase I(X; Y ). Unlike the min-max objective of GANs, all parameters are optimized towards the same objective. This bound holds for any choice of a(y) > 0, with simplifications recovering existing bounds. Letting a(y) be the constant e recovers the bound of Nguyen, Wainwright, and Jordan (Nguyen et al., 2010) also known as f-GAN KL (Nowozin et al., 2016) and MINE-f (Belghazi et al., 2018)1: Ep(x,y) [f(x, y)] e 1Ep(y) [Z(y)] INWJ. (7) This tractable bound no longer requires learning a(y), but now f(x, y) must learn to self-normalize, yielding a unique optimal critic f (x, y) = 1 + log p(x|y) p(x) . This requirement of self-normalization is a common choice when learning log-linear models and empirically has been shown not to negatively impact performance (Mnih & Teh, 2012). Finally, we can set a(y) to be the scalar exponential moving average (EMA) of ef(x,y) across minibatches. This pushes the normalization constant to be independent of y, but it no longer has to exactly self-normalize. With this choice of a(y), the gradients of ITUBA exactly yield the improved MINE gradient estimator from (Belghazi et al., 2018). This provides sound justification for the heuristic optimization 1ITUBA can also be derived the opposite direction by plugging the critic f (x, y) = f(x, y) log a(y) + 1 into INWJ. On Variational Bounds of Mutual Information procedure proposed by Belghazi et al. (2018). However, instead of using the critic in the IDV bound to get an estimate that is not a bound on MI as in Belghazi et al. (2018), one can compute an estimate with ITUBA which results in a valid lower bound. To summarize, these unnormalized bounds are attractive because they provide tractable estimators which become tight with the optimal critic. However, in practice they exhibit high variance due to their reliance on high variance upper bounds on the log partition function. 2.3. Multi-sample unnormalized lower bounds To reduce variance, we extend the unnormalized bounds to depend on multiple samples, and show how to recover the low-variance but high-bias MI estimator proposed by van den Oord et al. (2018). Our goal is to estimate I(X1, Y ) given samples from p(x1)p(y|x1) and access to K 1 additional samples x2:K r K 1(x2:K) (potentially from a different distribution than X1). For any random variable Z independent from X and Y , I(X, Z; Y ) = I(X; Y ), therefore: I(X1; Y ) = Er K 1(x2:K) [I(X1; Y )] = I (X1, X2:K; Y ) This multi-sample mutual information can be estimated using any of the previous bounds, and has the same optimal critic as for I(X1; Y ). For INWJ, we have that the optimal critic is f (x1:K, y) = 1 + log p(y|x1:K) p(y) = 1 + log p(y|x1) p(y) . However, the critic can now also depend on the additional samples x2:K. In particular, setting the critic to 1 + log ef(x1,y) a(y;x1:K) and r K 1(x2:K) = QK j=2 p(xj), INWJ becomes: I(X1; Y ) 1 + Ep(x1:K)p(y|x1) log ef(x1,y) Ep(x1:K)p(y) where we have written the critic using parameters a(y; x1:K) to highlight the close connection to the variational parameters in ITUBA. One way to leverage these additional samples from p(x) is to build a Monte-Carlo estimate of the partition function Z(y): a(y; x1:K) = m(y; x1:K) = 1 i=1 ef(xi,y). Intriguingly, with this choice, the high-variance term in INWJ that estimates an upper bound on log Z(y) is now upper bounded by log K as ef(x1,y) appears in the numerator and also in the denominator (scaled by 1 K ). If we average the bound over K replicates, reindexing x1 as xi for each term, then the last term in Eq. 8 becomes the constant 1: Ep(x1:K)p(y) i=1 E ef(xi,y) = Ep(x1:K)p(y) " 1 K PK i=1 ef(xi,y) and we exactly recover the lower bound on MI proposed by van den Oord et al. (2018): i=1 log ef(xi,yi) 1 K PK j=1 ef(xi,yj) where the expectation is over K independent samples from the joint distribution: Q j p(xj, yj). This provides a proof2 that INCE is a lower bound on MI. Unlike INWJ where the optimal critic depends on both the conditional and marginal densities, the optimal critic for INCE is f(x, y) = log p(y|x) + c(y) where c(y) is any function that depends on y but not x (Ma & Collins, 2018). Thus the critic only has to learn the conditional density and not the marginal density p(y). As pointed out in van den Oord et al. (2018), INCE is upper bounded by log K, meaning that this bound will be loose when I(X; Y ) > log K. Although the optimal critic does not depend on the batch size and can be fit with a smaller mini-batches, accurately estimating mutual information still needs a large batch size at test time if the mutual information is high. 2.4. Nonlinearly interpolated lower bounds The multi-sample perspective on INWJ allows us to make other choices for the functional form of the critic. Here we propose one simple form for a critic that allows us to nonlinearly interpolate between INWJ and INCE, effectively bridging the gap between the low-bias, high-variance INWJ estimator and the high-bias, low-variance INCE estimator. Similarly to Eq. 8, we set the critic to 1 + log ef(x1,y) αm(y;x1:K)+(1 α)q(y) with α [0, 1] to get a continuum of lower bounds: 1 + Ep(x1:K)p(y|x1) log ef(x1,y) αm(y; x1:K) + (1 α)q(y) Ep(x1:K)p(y) αm(y; x1:K) + (1 α)q(y) By interpolating between q(y) and m(y; x1:K), we can recover INWJ (α = 0) or INCE (α = 1). Unlike INCE which is 2The derivation by van den Oord et al. (2018) relied on an approximation, which we show is unnecessary. On Variational Bounds of Mutual Information upper bounded by log K, the interpolated bound is upper bounded by log K α , allowing us to use α to tune the tradeoff between bias and variance. We can maximize this lower bound in terms of q(y) and f. Note that unlike INCE, for α > 0 the last term does not vanish and we must sample y p(y) independently from x1:K to form a Monte Carlo approximation for that term. In practice we use a leave-oneout estimate, holding out an element from the minibatch for the independent y p(y) in the second term. We conjecture that the optimal critic for the interpolated bound is achieved when f(x, y) = log p(y|x) and q(y) = p(y) and use this choice when evaluating the accuracy of the estimates and gradients of Iα with optimal critics. 2.5. Structured bounds with tractable encoders In the previous sections we presented one variational upper bound and several variational lower bounds. While these bounds are flexible and can make use of any architecture or parameterization for the variational families, we can additionally take into account known problem structure. Here we present several special cases of the previous bounds that can be leveraged when the conditional distribution p(y|x) is known. This case is common in representation learning where x is data and y is a learned stochastic representation. Info NCE with a tractable conditional. An optimal critic for INCE is given by f(x, y) = log p(y|x), so we can simply use the p(y|x) when it is known. This gives us a lower bound on MI without additional variational parameters: i=1 log p(yi|xi) 1 K PK j=1 p(yi|xj) where the expectation is over Q j p(xj, yj). Leave one out upper bound. Recall that the variational upper bound (Eq. 1) is minimized when our variational q(y) matches the true marginal distribution p(y) = R dx p(x)p(y|x). Given a minibatch of K (xi, yi) pairs, we can approximate p(y) 1 K P i p(y|xi) (Chen et al., 2018). For each example xi in the minibatch, we can approximate p(y) with the mixture over all other elements: qi(y) = 1 K 1 P j =i p(y|xj). With this choice of variational distribution, the variational upper bound is: log p(yi|xi) 1 K 1 P j =i p(yi|xj) where the expectation is over Q i p(xi, yi). Combining Eq. 12 and Eq. 13, we can sandwich MI without introducing learned variational distributions. Note that the only difference between these bounds is whether p(yi|xi) is included in the denominator. Similar mixture distributions have been used in prior work but they require additional parameters (Tomczak & Welling, 2018; Kolchinsky et al., 2017). Reparameterizing critics. For INWJ, the optimal critic is given by 1 + log p(y|x) p(y) , so it is possible to use a critic f(x, y) = 1 + log p(y|x) q(y) and optimize only over q(y) when p(y|x) is known. The resulting bound resembles the variational upper bound (Eq. 1) with a correction term to make it a lower bound: Ep(x) [p(y|x)] = R + 1 Ep(y) Ep(x) [p(y|x)] This bound is valid for any choice of q(y), including unnormalized q. Similarly, for the interpolated bounds we can use f(x, y) = log p(y|x) and only optimize over the q(y) in the denominator. In practice, we find reparameterizing the critic to be beneficial as the critic no longer needs to learn the mapping between x and y, and instead only has to learn an approximate marginal q(y) in the typically lower-dimensional representation space. Upper bounding total correlation. Minimizing statistical dependency in representations is a common goal in disentangled representation learning. Prior work has focused on two approaches that both minimize lower bounds: (1) using adversarial learning (Kim & Mnih, 2018; Hjelm et al., 2018), or (2) using minibatch approximations where again a lower bound is minimized (Chen et al., 2018). To measure and minimize statistical dependency, we would like an upper bound, not a lower bound. In the case of a mean field encoder p(y|x) = Q i p(yi|x), we can factor the total correlation into two information terms, and form a tractable upper bound. First, we can write the total correlation as: TC(Y ) = P i I(X; Yi) I(X; Y ). We can then use either the standard (Eq. 1) or the leave one out upper bound (Eq. 13) for each term in the summation, and any of the lower bounds for I(X; Y ). If I(X; Y ) is small, we can use the leave one out upper bound (Eq. 13) and INCE (Eq. 12) for the lower bound and get a tractable upper bound on total correlation without any variational distributions or critics. Broadly, we can convert lower bounds on mutual information into upper bounds on KL divergences when the conditional distribution is tractable. 2.6. From density ratio estimators to bounds Note that the optimal critic for both INWJ and INCE are functions of the log density ratio log p(y|x) p(y) . So, given a log density ratio estimator, we can estimate the optimal critic and form a lower bound on MI. In practice, we find that On Variational Bounds of Mutual Information Figure 2. Performance of bounds at estimating mutual information. Top: The dataset p(x, y; ρ) is a correlated Gaussian with the correlation ρ stepping over time. Bottom: the dataset is created by drawing x, y p(x, y; ρ) and then transforming y to get (Wy)3 where Wij N(0, 1) and the cubing is elementwise. Critics are trained to maximize each lower bound on MI, and the objective (light) and smoothed objective (dark) are plotted for each technique and critic type. The single-sample bounds (INWJ and IJS) have higher variance than INCE and Iα, but achieve competitive estimates on both datasets. While INCE is a poor estimator of MI with the small training batch size of 64, the interpolated bounds are able to provide less biased estimates than INCE with less variance than INWJ. For the more challenging nonlinear relationship in the bottom set of panels, the best estimates of MI are with α = 0.01. Using a joint critic (orange) outperforms a separable critic (blue) for INWJ and IJS, while the multi-sample bounds are more robust to the choice of critic architecture. training a critic using the Jensen-Shannon divergence (as in Nowozin et al. (2016); Hjelm et al. (2018)), yields an estimate of the log density ratio that is lower variance and as accurate as training with INWJ. Empirically we find that training the critic using gradients of INWJ can be unstable due to the exp from the upper bound on the log partition function in the INWJ objective. Instead, one can train a log density ratio estimator to maximize a lower bound on the Jensen-Shannon (JS) divergence, and use the density ratio estimate in INWJ (see Appendix D for details). We call this approach IJS as we update the critic using the JS as in (Hjelm et al., 2018), but still compute a MI lower bound with INWJ. This approach is similar to (Poole et al., 2016; Mescheder et al., 2017) but results in a bound instead of an unbounded estimate based on a Monte-Carlo approximation of the f-divergence. 3. Experiments First, we evaluate the performance of MI bounds on two simple tractable toy problems. Then, we conduct a more thorough analysis of the bias/variance tradeoffs in MI estimates and gradient estimates given the optimal critic. Our goal in these experiments was to verify the theoretical results in Section 2, and show that the interpolated bounds can achieve better estimates of MI when the relationship between the variables is nonlinear. Finally, we highlight the utility of these bounds for disentangled representation learning on the d Sprites datasets. Comparing estimates across different lower bounds. We applied our estimators to two different toy problems, (1) a correlated Gaussian problem taken from Belghazi et al. (2018) where (x, y) are drawn from a 20-d Gaussian distribution with correlation ρ (see Appendix B for details), and we vary ρ over time, and (2) the same as in (1) but we apply a random linear transformation followed by a cubic nonlinearity to y to get samples (x, (Wy)3). As long as the linear transformation is full rank, I(X; Y ) = I(X; (WY )3). We find that the single-sample unnormalized critic estimates of MI exhibit high variance, and are challenging to tune for even these problems. In congtrast, the multi-sample estimates of INCE are low variance, but have estimates that saturate at log(batch size). The interpolated bounds trade off bias for variance, and achieve the best estimates of MI for the second problem. None of the estimators exhibit low variance and good estimates of MI at high rates, supporting the theoretical findings of Mc Allester & Stratos (2018). Efficiency-accuracy tradeoffs for critic architectures. One major difference between the critic architectures used in (van den Oord et al., 2018) and (Belghazi et al., 2018) is the structure of the critic architecture. van den Oord et al. (2018) uses a separable critic f(x, y) = h(x)T g(y) which requires only 2N forward passes through a neural network for a batch size of N. However, Belghazi et al. (2018) use a joint critic, where x, y are concatenated and fed as input to one network, thus requiring N 2 forward passes. For both toy problems, we found that separable critics (orange) On Variational Bounds of Mutual Information Figure 3. Bias and variance of MI estimates with the optimal critic. While INWJ is unbiased when given the optimal critic, INCE can exhibit large bias that grows linearly with MI. The Iα bounds trade off bias and variance to recover more accurate bounds in terms of MSE in certain regimes. increased the variance of the estimator and generally performed worse than joint critics (blue) when using INWJ or IJS (Fig. 2). However, joint critics scale poorly with batch size, and it is possible that separable critics require larger neural networks to get similar performance. Bias-variance tradeoff for optimal critics. To better understand the behavior of different estimators, we analyzed the bias and variance of each estimator as a function of batch size given the optimal critic (Fig. 3). We again evaluated the estimators on the 20-d correlated Gaussian distribution and varied ρ to achieve different values of MI. While INWJ is an unbiased estimator of MI, it exhibits high variance when the MI is large and the batch size is small. As noted in van den Oord et al. (2018), the INCE estimate is upper bounded by log(batch size). This results in high bias but low variance when the batch size is small and the MI is large. In this regime, the absolute value of the bias grows linearly with MI because the objective saturates to a constant while the MI continues to grow linearly. In contrast, the Iα bounds are less biased than INCE and lower variance than INWJ, resulting in a mean squared error (MSE) that can be smaller than either INWJ or INCE. We can also see that the leave one out upper bound (Eq. 13) has large bias and variance when the batch size is too small. Bias-variance tradeoffs for representation learning. To better understand whether the bias and variance of the estimated MI impact representation learning, we looked at the accuracy of the gradients of the estimates with respect to a stochastic encoder p(y|x) versus the true gradient of MI with respect to the encoder. In order to have access to ground truth gradients, we restrict our model to pρ(yi|xi) = N(ρix, p 1 ρ2 i ) where we have a separate Figure 4. Gradient accuracy of MI estimators. Left: MSE between the true encoder gradients and approximate gradients as a function of mutual information and batch size (colors the same as in Fig. 3 ). Right: For each mutual information and batch size, we evaluated the Iα bound with different αs and found the α that had the smallest gradient MSE. For small MI and small size, INCE-like objectives are preferred, while for large MI and large batch size, INWJ-like objectives are preferred. correlation parameter for each dimension i, and look at the gradient of MI with respect to the vector of parameters ρ. We evaluate the accuracy of the gradients by computing the MSE between the true and approximate gradients. For different settings of the parameters ρ, we identify which α performs best as a function of batch size and mutual information. In Fig. 4, we show that the optimal α for the interpolated bounds depends strongly on batch size and the true mutual information. For smaller batch sizes and MIs, α close to 1 (INCE) is preferred, while for larger batch sizes and MIs, α closer to 0 (INWJ) is preferred. The reduced gradient MSE of the Iα bounds points to their utility as an objective for training encoders in the Info Max setting. 3.1. Decoder-free representation learning on d Sprites Many recent papers in representation learning have focused on learning latent representations in a generative model that correspond to human-interpretable or disentangled concepts (Higgins et al., 2016; Burgess et al., 2018; Chen et al., 2018; Kumar et al., 2017). While the exact definition of disentangling remains elusive (Locatello et al., 2018; Higgins et al., 2018; Mathieu et al., 2018), many papers have focused on reducing statistical dependency between latent variables as a proxy (Kim & Mnih, 2018; Chen et al., 2018; Kumar et al., 2017). Here we show how a decoder-free information maximization approach subject to smoothness and independence constraints can retain much of the representation learning capabilities of latent-variable generative models on the d Sprites dataset (a 2d dataset of white shapes on a black background with varying shape, rotation, scale, and position from Matthey et al. (2017)). To estimate and maximize the information contained in the representation Y about the input X, we use the IJS lower bound, with a structured critic that leverages the known On Variational Bounds of Mutual Information stochastic encoder p(y|x) but learns an unnormalized variational approximation q(y) to the prior. To encourage independence, we form an upper bound on the total correlation of the representation, TC(Y ), by leveraging our novel variational bounds. In particular, we reuse the IJS lower bound of I(X; Y ), and use the leave one out upper bounds (Eq. 13) for each I(X; Yi). Unlike prior work in this area with VAEs, (Kim & Mnih, 2018; Chen et al., 2018; Hjelm et al., 2018; Kumar et al., 2017), this approach tractably estimates and removes statistical dependency in the representation without resorting to adversarial techniques, moment matching, or minibatch lower bounds in the wrong direction. As demonstrated in Krause et al. (2010), information maximization alone is ineffective at learning useful representations from finite data. Furthermore, minimizing statistical dependency is also insufficient, as we can always find an invertible function that maintains the same amount of information and correlation structure, but scrambles the representation (Locatello et al., 2018). We can avoid these issues by introducing additional inductive biases into the representation learning problem. In particular, here we add a simple smoothness regularizer that forces nearby points in x space to be mapped to similar regions in y space: R(θ) = KL(pθ(y|x) pθ(y|x + ϵ)) where ϵ N(0, 0.5). The resulting regularized Info Max objective we optimize is: maximize p(y|x) I(X; Y ) (15) subject to TC(Y ) = i=1 I(X; Yi) I(X; Y ) δ Ep(x)p(ϵ) [KL(p(y|x) p(y|x + ϵ))] γ We use the convolutional encoder architecture from Burgess et al. (2018); Locatello et al. (2018) for p(y|x), and a two hidden layer fully-connected neural network to parameterize the unnormalized variational marginal q(y) used by IJS. Empirically, we find that this variational regularized infomax objective is able to learn x and y position, and scale, but not rotation (Fig. 5, see Chen et al. (2018) for more details on the visualization). To the best of our knowledge, the only other decoder-free representation learning result on d Sprites is Pfau & Burgess (2018), which recovers shape and rotation but not scale on a simplified version of the d Sprites dataset with one shape. 4. Discussion In this work, we reviewed and presented several new bounds on mutual information. We showed that our new interpolated bounds are able to trade off bias for variance to yield better estimates of MI. However, none of the approaches we considered here are capable of providing low-variance, Figure 5. Feature selectivity on d Sprites. The representation learned with our regularized Info Max objective exhibits disentangled features for position and scale, but not rotation. Each row corresponds to a different active latent dimension. The first column depicts the position tuning of the latent variable, where the x and y axis correspond to x/y position, and the color corresponds to the average activation of the latent variable in response to an input at that position (red is high, blue is low). The scale and rotation columns show the average value of the latent on the y axis, and the value of the ground truth factor (scale or rotation) on the x axis. low-bias estimates when the MI is large and the batch size is small. Future work should identify whether such estimators are impossible (Mc Allester & Stratos, 2018), or whether certain distributional assumptions or neural network inductive biases can be leveraged to build tractable estimators. Alternatively, it may be easier to estimate gradients of MI than estimating MI. For example, maximizing IBA is feasible even though we do not have access to the constant data entropy. There may be better approaches in this setting when we do not care about MI estimation and only care about computing gradients of MI for minimization or maximization. A limitation of our analysis and experiments is that they focus on the regime where the dataset is infinite and there is no overfitting. In this setting, we do not have to worry about differences in MI on training vs. heldout data, nor do we have to tackle biases of finite samples. Addressing and understanding this regime is an important area for future work. Another open question is whether mutual information maximization is a more useful objective for representation learning than other unsupervised or self-supervised approaches (Noroozi & Favaro, 2016; Doersch et al., 2015; Dosovitskiy et al., 2014). While deviating from mutual information maximization loses a number of connections to information theory, it may provide other mechanisms for learning features that are useful for downstream tasks. In future work, we hope to evaluate these estimators on larger-scale representation learning tasks to address these questions. On Variational Bounds of Mutual Information Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. ar Xiv preprint ar Xiv:1612.00410, 2016. Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., and Murphy, K. Fixing a broken elbo, 2017. Barber, D. and Agakov, F. The im algorithm: A variational approach to information maximization. In NIPS, pp. 201 208. MIT Press, 2003. Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Hjelm, D., and Courville, A. Mutual information neural estimation. In International Conference on Machine Learning, pp. 530 539, 2018. Bell, A. J. and Sejnowski, T. J. An informationmaximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129 1159, 1995. Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017. Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. Understanding disentangling in β-vae. ar Xiv preprint ar Xiv:1804.03599, 2018. Chen, T. Q., Li, X., Grosse, R., and Duvenaud, D. Isolating sources of disentanglement in variational autoencoders. ar Xiv preprint ar Xiv:1802.04942, 2018. Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422 1430, 2015. Donsker, M. D. and Varadhan, S. S. Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36 (2):183 212, 1983. Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 766 774, 2014. Foster, A., Jankowiak, M., Bingham, E., Teh, Y. W., Rainforth, T., and Goodman, N. Variational optimal experiment design: Efficient automation of adaptive experiments. In Neur IPS Bayesian Deep Learning Workshop, 2018. Gabri e, M., Manoel, A., Luneau, C., Barbier, J., Macris, N., Krzakala, F., and Zdeborov a, L. Entropy and mutual information in models of deep neural networks. ar Xiv preprint ar Xiv:1805.09785, 2018. Gao, S., Ver Steeg, G., and Galstyan, A. Efficient estimation of mutual information for strongly dependent variables. In Artificial Intelligence and Statistics, pp. 277 286, 2015. Higgins, I., Matthey, L., Glorot, X., Pal, A., Uria, B., Blundell, C., Mohamed, S., and Lerchner, A. Early visual concept learning with unsupervised deep learning. ar Xiv preprint ar Xiv:1606.05579, 2016. Higgins, I., Amos, D., Pfau, D., Racani ere, S., Matthey, L., Rezende, D. J., and Lerchner, A. Towards a definition of disentangled representations. Co RR, abs/1812.02230, 2018. Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. ar Xiv preprint ar Xiv:1808.06670, 2018. Hu, W., Miyato, T., Tokui, S., Matsumoto, E., and Sugiyama, M. Learning discrete representations via information maximizing self-augmented training. ar Xiv preprint ar Xiv:1702.08720, 2017. Kim, H. and Mnih, A. Disentangling by factorising. ar Xiv preprint ar Xiv:1802.05983, 2018. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. nternational Conference on Learning Representations, 2013. Kolchinsky, A., Tracey, B. D., and Wolpert, D. H. Nonlinear information bottleneck. ar Xiv preprint ar Xiv:1705.02436, 2017. Kraskov, A., St ogbauer, H., and Grassberger, P. Estimating mutual information. Physical review E, 69(6):066138, 2004. Krause, A., Perona, P., and Gomes, R. G. Discriminative clustering by regularized information maximization. In Advances in neural information processing systems, pp. 775 783, 2010. Kumar, A., Sattigeri, P., and Balakrishnan, A. Variational inference of disentangled latent concepts from unlabeled observations. ar Xiv preprint ar Xiv:1711.00848, 2017. Locatello, F., Bauer, S., Lucic, M., Gelly, S., Sch olkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. ar Xiv preprint ar Xiv:1811.12359, 2018. On Variational Bounds of Mutual Information Ma, Z. and Collins, M. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. ar Xiv preprint ar Xiv:1809.01812, 2018. Mathieu, E., Rainforth, T., Siddharth, N., and Teh, Y. W. Disentangling disentanglement in variational auto-encoders, 2018. Matthey, L., Higgins, I., Hassabis, D., and Lerchner, A. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017. Mc Allester, D. and Stratos, K. Formal limitations on the measurement of mutual information, 2018. Mescheder, L., Nowozin, S., and Geiger, A. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. ar Xiv preprint ar Xiv:1701.04722, 2017. Mnih, A. and Teh, Y. W. A fast and simple algorithm for training neural probabilistic language models. ar Xiv preprint ar Xiv:1206.6426, 2012. Moyer, D., Gao, S., Brekelmans, R., Galstyan, A., and Ver Steeg, G. Invariant representations without adversarial training. In Advances in Neural Information Processing Systems, pp. 9102 9111, 2018. Nemenman, I., Bialek, W., and van Steveninck, R. d. R. Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69(5): 056111, 2004. Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010. Noroozi, M. and Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69 84. Springer, 2016. Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271 279, 2016. Palmer, S. E., Marre, O., Berry, M. J., and Bialek, W. Predictive information in a sensory population. Proceedings of the National Academy of Sciences, 112(22):6908 6913, 2015. Paninski, L. Estimation of entropy and mutual information. Neural computation, 15(6):1191 1253, 2003. Peng, X. B., Kanazawa, A., Toyer, S., Abbeel, P., and Levine, S. Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. ar Xiv preprint ar Xiv:1810.00821, 2018. Pfau, D. and Burgess, C. P. Minimally redundant laplacian eigenmaps. 2018. Poole, B., Alemi, A. A., Sohl-Dickstein, J., and Angelova, A. Improved generator objectives for gans. ar Xiv preprint ar Xiv:1612.02780, 2016. Rainforth, T., Cornish, R., Yang, H., and Warrington, A. On nesting monte carlo estimators. In International Conference on Machine Learning, pp. 4264 4273, 2018. Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., Mc Vean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. Detecting novel associations in large data sets. science, 334(6062):1518 1524, 2011. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278 1286, 2014. Ryan, E. G., Drovandi, C. C., Mc Gree, J. M., and Pettitt, A. N. A review of modern computational algorithms for bayesian optimal design. International Statistical Review, 84(1):128 154, 2016. Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., and Cox, D. D. On the information bottleneck theory of deep learning. In International Conference on Learning Representations, 2018. Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE, pp. 1 5. IEEE, 2015. Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. ar Xiv preprint physics/0004057, 2000. Tomczak, J. and Welling, M. Vae with a vampprior. In Storkey, A. and Perez-Cruz, F. (eds.), Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 1214 1223, Playa Blanca, Lanzarote, Canary Islands, 09 11 Apr 2018. PMLR. van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790 4798, 2016. van den Oord, A., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.