# improved_mutual_information_estimation__f623a719.pdf

Improved Mutual Information Estimation

Youssef Mroueh, Igor Melnyk, Pierre Dognin, Jarret Ross, Tom Sercu *

IBM Research AI

We propose to estimate the KL divergence using a relaxed likelihood ratio estimation in a Reproducing Kernel Hilbert space. We show that the dual of our ratio estimator for KL in the particular case of Mutual Information estimation corresponds to a lower bound on the MI that is related to the so called Donsker Varadhan lower bound. In this dual form, MI is estimated via learning a witness function discriminating between the joint density and the product of marginal, as well as an auxiliary scalar variable that enforces a normalization constraint on the likelihood ratio. By extending the function space to neural networks, we propose an efﬁcient neural MI estimator, and validate its performance on synthetic examples, showing advantage over the existing baselines. We demonstrate its strength in large-scale self-supervised representation learning through MI maximization.

1 Introduction Mutual information (MI) is an ubiquitous measure of dependency between a pair of random variables, and is one of the corner stones of information theory. In machine learning, the information maximization principle for learning representation from unlabeled data through self-supervision (Bell and Sejnowski 1995) motivated the development of many MI estimators and applications (Hjelm et al. 2019; Noroozi and Favaro 2016; Kolesnikov, Zhai, and Beyer 2019; Doersch, Gupta, and Efros 2015; Oord, Li, and Vinyals 2018; Hu et al. 2017). The information bottleneck (Tishby, Pereira, and Bialek 1999; Kolchinsky, Tracey, and Wolpert 2017) is another principle that triggered recent interest in mutual information estimation. MI is also used to understand the information ﬂow in neural networks, in learning clusters (Krause, Perona, and Gomes 2010) and in regularizing the training of Generative Adversarial Networks (GANs) (Chen et al. 2016). In many of these machine learning applications and other scientiﬁc ﬁelds, one has to estimate MI given samples from the joint distribution of high dimensional random variables. Since MI is deﬁned as the Kullback-Leibler (KL) divergence between the joint distribution and the product of marginals, one can leverage non parametric estimators of f-divergences

*Tom Sercu is now with FAIR. Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

(Nguyen, Wainwright, and Jordan 2008; Nowozin, Cseke, and Tomioka 2016; Sriperumbudur et al. 2009). Speciﬁcally of interest to us is the Donsker-Varadhan (DV) representation of the KL divergence (Donsker and Varadhan 1976) that was used recently with neural networks estimators (Belghazi et al. 2018; Poole et al. 2019). Other approaches to estimating the MI are through ﬁnding lower bounds using variational Bayesian methods (Alemi et al. 2017, 2018; Barber and Agakov 2003; Blei, Kucukelbir, and Mc Auliffe 2017), through geometric methods like binning (Kraskov, St ogbauer, and Grassberger 2004a), k-nearest neighbors (Kraskov, St ogbauer, and Grassberger 2004b), kernel density (Kandasamy et al. 2015; Han et al. 2017), ensemble estimation (Moon, Sricharan, and Hero 2017), jackknife approach (Zeng, Xia, and Tong 2018), Gaussian copula (Singh and P oczos 2017), to name a few. In this paper, we propose a new estimator of MI that can be used in direct MI maximization or as a regularizer, thanks to its unbiased gradients. Our starting point is the Donsker Varadhan (DV) lower bound of the KL divergence that we represent equivalently via a joint optimization that we call η-DV on a witness function f and an auxiliary variable η in Section 2. In Section 3, we show that when the witness function f is learned in a Reproducing Kernel Hilbert Space (RKHS) the η-DV problem is jointly convex in both f and η. The dual of this problem sheds the light on this estimator as a constrained ratio estimation where η plays the role of a Lagrange multiplier that ensures proper normalization of the likelihood ratio. We also show how the witness function can be estimated as a neural network akin to (Belghazi et al. 2018). We specify our estimator for MI in Section 4, and show how it compares to alternatives in the literature (Nguyen, Wainwright, and Jordan 2008; Belghazi et al. 2018; Poole et al. 2019). The experiments are presented in Section 5. On synthetic data, we validate our estimators by estimating MI on Gaussian variables and by regularizing GAN training as in (Chen et al. 2016). On real data, we explore our estimator in deep MI maximization for learning representation from unlabeled data. Figure 1 shows an overview of all the bounds and related MI estimators discussed in this paper.

2 Lower Bounds on KL Divergences and MI Consider two probability distributions P and Q, where P is absolutely continuous w.r.t. Q. Let p and q be their respective

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Lower Bounds on

Estimators of

! EP [f] log(EQ ef ) sup f2H, 2R

! EP [f] e EQ ef + 1

! EP [f] EQef + 1

a-MINE DV-MINE

n Epxy [f(x, y)] Epxpy h ef(x,y)i + 1

n Epxy [f(x, y)] log Epxpy h ef(x,y)i o

n Epxy [f(x, y)] Epxpy h ef(x,y) i +1

n Epxy [f(x, y)] Epy Epx h ef(x,y) (y)i (y) +1

Figure 1: Overview of the paper. Top row shows several KL divergence lower bounds for two probability distributions P and Q. By substituting P=pxy, Q=pxpy, and deﬁning f as a neural network, we obtain corresponding MI estimators. η DV and η-MINE are the proposed bound and its derived estimator.

densities deﬁned on X Rd. Their KL divergence is deﬁned as KL(P, Q) = Ex P log p(x) q(x) = R p(x) log p(x) q(x) dx. We are interested in the MI between two random variables X, Y where X is deﬁned on X Rdx, and Y on Y Rdy. Let pxy be their joint densities and px, py the marginals of X and Y respectively. The MI is deﬁned as follows:

I(X; Y ) = KL(pxy, pxpy), (1)

which is the KL divergence between the joint density and the product of marginals. Non-parametric estimation of MI from samples is an important problem in science and machine learning. In what follows, we review variational lower bounds on KL to enable such estimation. Variational Characterization of KL divergence. Let H be any function space mapping X to R. The ﬁrst variational characterization of the KL divergence goes back to (Donsker and Varadhan 1976):

KL(P, Q) DH DV(P, Q) = sup{Ex Pf(x)

log(Ex Qef(x)) : f H }, (2)

where the equality holds if and only if f = log(p/q) H . We refer to this bound as the DV bound. The second variational representation was introduced in (Nguyen, Wainwright, and Jordan 2008; Nowozin, Cseke, and Tomioka 2016) and derived through convex duality to be ﬁnally stated as follows:

KL(P, Q) DH f (P, Q) = 1 + sup{Ex Pf(x)

Ex Qef(x) : f H }, (3)

with equality iif f = log(p/q) H . We call this bound the f-div bound (as in f-divergence). From Eq. 2 and Eq. 3 we see that the variational bounds are attempting to estimate the log-likelihood ratio f =log(p/q), and the tightness of the bound depends on the representation power of H . In order to compare these two lower bounds, observe that log(t) t 1, t > 0. Therefore log Ex Qef(x) Ex Qef(x) 1, which means that for

any function space H we have:

KL(P, Q) DH DV(P, Q) DH f (P, Q), (4)

from which we conclude that the DV bound is tighter than the f-div bound. Now, given samples {xi, i = 1 . . . N, xi P}, {yi, i = 1 . . . N, yi Q}, estimating the KL divergence can be done by computing the variational bound from Monte-Carlo simulation. Speciﬁcally, for the DV bound we have the following estimator:

b DH DV(P, Q) = sup f H

i=1 f(xi) log

i=1 ef(yi) !

(5) The Mutual Information Neural Estimator (MINE) (Belghazi et al. 2018) considered Eq. 5 with the hypothesis class H being a neural network. For the f-div bound we have a similar estimator (for which a convex version with an RKHS H was introduced by (Nguyen, Wainwright, and Jordan 2008)):

b DH f (P, Q) = 1 + sup f H

i=1 f(xi) 1

i=1 ef(yi). (6)

While DV bound is tighter than the f-div bound, in order to learn the function f using stochastic gradient optimization, f-div is a better ﬁt because the cost function is linear in the expectation, whereas in the DV bound, a log non-linearity is applied to the expectation. This non-linearity introduces biases in the mini-batch estimation of the cost function as noted in (Belghazi et al. 2018). In the following, we show how to alleviate this problem and remove the non-linearity at the price of an additional auxiliary variable that will enable better estimation of the DV bound. An η-trick for the DV Bound. We start with the following elementary Lemma, that gives a variational characterization of the log. All proofs are given in the Appendix A. Lemma 1. Let x > 0, we have: log(x) = minη e ηx+η 1.

Using Lemma 1 we can now linearize the log in the DV bound, Eq. 5.

Lemma 2 (η-Donsker-Varadhan). Let H be any function space mapping X to R:

KL(P, Q) DH η-DV(P, Q)= inf{L(f, η):f H , η R} (7)

L(f, η) = e ηEx Qef(x) Ex Pf(x) + η 1, (8) We refer to this bound as η-DV bound. Note that for η=0 we recover the f-div bound. Using Lemma 2, we can now rewrite the estimator for the DV bound in Eq. 5 as follows:

b DH DV(P, Q) = inf f,η ˆL(f, η)

= inf f,η e η 1

i=1 f(xi) + η 1, (9)

which enables unbiased stochastic gradient optimization of the function f. We note that similar variational tricks of nonlinearities have been devised for g(η) = η in (Argyriou, Evgeniou, and Pontil 2008; Bach, Jenatton, and Mairal 2011).

3 What Do KALE Learn? KALE from (Gretton, Sutherland, and Jitkrittum 2019) refers to KL Approximate Lower-bound Estimators introduced earlier (DV, η-DV, and f-div). In this section we show that KALE estimates, whose witness functions are estimated in RKHS, learn likelihood ratio estimates r in the maximum mean discrepancy sense. Considering the dual of the KL lower bounds optimized in RKHS , we establish that likelihood ratio r appears naturally as the dual variable of the witness function f. See Figure 2 for an illustration. For simplicity, we consider RKHS with a ﬁnite dimensional feature map, i.e., H = {f|f(x) = w, Φ(x) , Φ : X Rm, w Rm}. Now for f H , the loss given in Eq. 8 for η-DV can be rewritten as follows:

L(f, η) = L(w, η)

= e ηEx Qe w,Φ(x) w, Ex PΦ(x) + η 1.

Following (Nguyen, Wainwright, and Jordan 2008), we consider the following regularized loss L (w, η) = L(w, η) + Ω(w), and the corresponding sample-based formulation ˆ L (w, η) = ˆL(w, η) + Ω(w), where

ˆL(w, η) = e η

i=1 e w,Φ(yi)

and Ω(w) is a convex regularizer, e.g., Ω(w) = λ

2 w 2 2 . The η-DV primal is deﬁned as:

η-DV P : min w,η L(w, η) + Ω(w), (10)

Let (w , η ) arg minw,η ˆL(w, η)+Ω(w) be the empirical estimator. The KALE estimate is: b DH DV(P, Q)= ˆL(w , η ). In the following, we show that L (w, η) is jointly convex in (w, η) and derive its dual, which will shed light on the nature of the likelihood ratio estimate and the role of η. Convex Estimate in RKHS. In Lemma 3 we ﬁrst establish the convexity of the η-DV loss function.

Lemma 3. L and ˆ L are jointly convex in w and η. η-DV Dual is a Constrained Likelihood Ratio Estimation. Given a ratio estimate ˆr of p/q, the KL divergence is computed as EQˆr log(ˆr) (Mohamed and Lakshminarayanan 2016), which can be easily estimated using samples from Q. In Theorem 1, proven in Appendix A, we show that the dual problem (denoted D), corresponding to the primal minimization problem (denoted P) of η-DV, reduces to a constrained likelihood ratio estimation problem (denoted C). Theorem 1. Let Ω (.) be the Fenchel conjugate of Ω(.). The η-DV bound restricted to an RKHS amounts to the following regularized convex minimization problem: P = (minw,η L(w, η) + Ω(w)), with its dual form (D):

min r:X R+ max η EQr log (r)+(η 1) (EQr 1)+Ω ( (r)) ,

where (r) = Ex PΦ(x) Ey Qr(y)Φ(y). (11)

Noticing that η 1 plays the role of Lagrangian multiplier, it is equivalent to the likelihood ratio estimation problem (C):

min r>0 EQr(y) log (r(y)) +Ω (Ex PΦ(x) Ey Qr(y)Φ(y))

such that Ey Qr(y) = 1. (12)

Therefore, we have P =D=C. Let (w , η ) be an optimizer of P. Let r be an optimizer of D, then the KL estimate is:

DH DV (P, Q) = EQr log (r ) = L(w , η ),

and the KALE witness function f (x ) = w , Φ(x ) is

f (x ) = Ex PΦ(x) Ey Qr (y)Φ(y), Φ(x ) .

The regularizer Ω(w) = λ 2 w 2 can now be given the following interpretation based on the results of Theorem 1. Recall the deﬁnition of the MMD (Gretton et al. 2012) between distributions using an RKHS:

MMDΦ(P, Q) = Ex PΦ(x) Ey QΦ(y) . (13)

Replacing the Fenchel conjugate Ω (.) by its expression in Theorem 1, we see that η-DV is equivalent to the following dual (η-DV D):

min r>0 EQr log (r) + (η 1) (EQr 1) + 1

2λMMD2 Φ(rq, p)

(14) or as a constrained ratio estimation problem (η-DV C):

min r>0 EQr log (r)+ 1

2λMMD2 Φ(rq, p) s.t. Ey Qr(y) = 1.

Hence, it is clear now that the η-DV optimization problem is equivalent to the constrained likelihood ratio estimation problem r given in Eq. 15, where the ratio is estimated using the MMD distance in the RKHS between rq and p. It is also easy to see that p/q is a feasible point and for the universal feature map MMDΦ(rq, p) = 0 iif r = p/q, therefore, for a universal kernel, p/q is optimal and we recover the KL divergence for r = p/q. When comparing η-DV D (Eq. 14) and η-DV C (Eq. 15), we see that η 1 plays the role of a

sup f2H, 2R

! EP [f] e EQ ef + 1

Dual in RKHS

! EP [f] EQef + 1

Figure 2: Comparison of dual formations of η DV and f div in RKHS. Here, r is an estimator for density ratio p/q and MMDΦ(P, Q) = Ex PΦ(x) Ey QΦ(y) . Both duals have a relative entropy term, but f div bound does not impose a normalization constraint on the ratio, which biases the estimate, while η DV uses η as Lagrangian multiplier to impose the constraint EQ[r]=1 and ensure density normalization.

Lagrangian that ensures that rq is indeed a normalized distribution. In practice, we solve the primal problem (P), while the dual problem (D) and its equivalent constrained form (C) explains why this formulation estimates the KL divergence and reveals η as a Lagrangian multiplier, enforcing a normalization constraint. Let r be the solution, then: DH DV(P, Q) = EQr log (r ) = L(w , η ). (16) KALE s witness function f in Theorem 1 shares similarity with the witness function with the MMD. It is MMD witness function of a rescaled distribution Q with likelihood ratio r . Ratio estimation via MMD matching for covariate shift appeared in (Gretton et al. 2009; Sugiyama et al. 2008). Comparison to f-div. We also restrict f-div bound to an RKHS, which is equivalent to the following ratio estimation (follows from the proof of Theorem 1 by eliminating max on η, and setting η=0), and is consistent with the results of (Nguyen, Wainwright, and Jordan 2008) (see Eq. 51 therein):

f-div : min r>0 EQr log (r) + (1 EQr) + 1

2λMMD2 Φ(rq, p).

Let r be the optimum, then the KL divergence can be estimated as follows: DH f (P, Q) = EQr log (r ) + (1 EQr ) . (17)

Comparing DH DV(P, Q) and DH f (P, Q) we see that they both have a relative entropy term but the f-div bound does not impose the normalization constraint on the ratio, which biases the estimate. Empirical Estimation. Note that in Theorem 1, if we replace the loss L by its empirical counterpart ˆL from Eq. 9, the equivalent Dual takes the following form:

min ri>0 1 N

i=1 ri log (ri) + Ω ( 1

i=1 Φ(xi) 1

i=1 riΦ(yi))

subject to: 1

i=1 ri = 1, (18)

and the KALE is given by:

b DH DV(P, Q) = 1

i=1 r i log (r i ) = ˆL(w , η ).

From RKHS to Neural Estimation. One shortcoming of the RKHS approach is that the method depends on the choice of feature map Φ. We propose to learn Φ as a deep neural network, as in MINE (Belghazi et al. 2018). Given samples xi from P, yi from Q, the KL estimation becomes:

b DNN DV(P, Q) = min Φ ˆLΦ(w, η) , (19)

which can be solved using BCD on (w, η, Φ). Note that if Φ( ) is known and ﬁxed, then optimization problem in Eq. 19 becomes convex in η and w. We refer to this bound as η-DV convex.

Algorithm 1 η-MINE (Stochastic BCD )

Inputs: X, Y dataset X RN dx, Y RN dy, such that (xi = Xi,., yi = Yi,.) pxy Hyperparameters: αη, αθ (learning rates), nc (number of critic updates) Initialize η, θ parameter of the neural network fθ for i = 1 . . . Maxiter do for j = 1 . . . nc do Fetch a minibatch of size N (xi, yi) pxy Fetch a minibatch of size N (xi, yi) pxpy Evaluate ˆL(fθ, η) Stochastic Gradient step on θ: θ θ α ˆL(fθ,η)

θ end for Update η: η η αη ˆL(fθ,η)

η end for Output: fθ, η, ˆIH η-DV(X, Y ) = 1 N PN i=1 fθ(xi, yi) e η 1

N PN i=1 efθ(xi, yi) η + 1

Observations about what Neural KL/MI Estimators Learn: 1) Ratio estimation via Feature Matching. KL divergence estimation with variational bounds boils down to a ratio estimation using a form of MMD matching. 2) Choice of Feature/Architecture. The choice of feature space or architecture of the network introduces bias in ratio estimation; also observed in (Tschannen et al. 2019). 3) Ratio Normalization. η-DV bound introduces a normalization constraint on the ratio ensuring a better estimate.

MI Estimator Loss to minimize ˆL Constraints (bound) DV-MINE log 1 N PN i=1 ef(xi, yi) 1

N PN i=1 f(xi, yi) f is a DNN (DV) Unbiased 1 N PN i=1 ef(xi, yi)

N PN i=1 f(xi, yi) f is a DNN, m running DV-MINE avg. of 1 N PN i=1ef(xi, yi)

f-MINE 1 N PN i=1 ef(xi, yi) 1

N PN i=1 f(xi, yi) 1 f is in RKHS (convex) (f-div) or f is a DNN a-MINE 1 N PN i=1 ef(xi, yi)

a( yi) + log(a( yi)) 1

N PN i=1 f(xi, yi) 1 a is a DNN, a > 0 (DV) f is a DNN Info NCE 1 N PN i=1 log 1

N PN j=1 ef(xi, yj) f(xi, yi) f is a DNN (-) η-MINE e η 1

N PN i=1 ef(xi, yi) PN i=1 f(xi, yi) + η 1 f is in RKHS (ours:η-DV) or f is a DNN; η R η-MINE-convex e η 1

N PN i=1 e w,Φ(xi, yi) PN i=1 w, Φ(xi, yi) + η 1 Φ( ) is ﬁxed (ours:η-DV) w Rdim(Φ), η R

Table 1: Given iid samples from marginals xi px and yi py and samples from the joint (xi, yi) pxy, we list some MI estimators, corresponding variational bounds, associated losses, and constraints on the function space. MI estimators include biased and unbiased DV-MINE from DV (Belghazi et al. 2018), f-MINE from f-div (Nguyen, Wainwright, and Jordan 2008; Nowozin, Cseke, and Tomioka 2016), Info NCE (Oord, Li, and Vinyals 2018), a-MINE from DV (Poole et al. 2019) and η-MINE, η-MINE-convex from η DV (ours).

4 η-DV Mutual Information Estimation We now specialize the KL estimators given in Section 3 for the task of MI estimation. Given a function space H deﬁned on X Y,

I(X; Y ) IH DV(X; Y ) IH f-div(X; Y )

= sup f H Epxyf(x, y) Epx Epyef(x,y), (20)

where IH DV = supf H Epxyf(x, y) log Epx Epyef(x,y) . Equivalently, with the η-trick we can rewrite

IH DV = inf f H ,η e ηEpxpyef(x,y) Epxyf(x, y)+η 1.

Now, given iid samples from marginals xi px, and yi py, for i = 1 . . . N, and samples from the joint (xi, yi) pxy, for i = 1 . . . N, we can estimate the MI as follows ˆDH DV(px,y, pxpy) given in the expression below:

inf f H ,η e η 1

i=1 ef(xi, yi) 1

i=1 f(xi, yi) + η 1.

When H is an RKHS, η can be seen as a Lagrangian ensuring that the ratio estimation r(x, y) of pxy pxpy is normalized to one when integrated on pxpy, i.e., η is a Lagrangian associated with the constraint R

X Y r(x, y)px(x)py(y)dxdy = 1. Table 1 is a review of other variational bounds for MI based on f-div, DV, and η-DV bounds; a-MINE from (Poole et al. 2019) is discussed next. We establish in Appendix C that the DV bound can be made tighter and we land on a-MINE estimator from (Poole et al. 2019) that we refer to as IH a-DV(X; Y ). We then derive the following hierarchy of lower bounds: I(X; Y ) IH a-DV(X; Y ) IH η-DV(X; Y ) IH f-div(X; Y ). We discuss in Appendix C that while this may suggest a tighter bound, than

η-DV, it is prone to higher estimation errors since a-MINE estimates a function a(y) as it can be seen in Table 1, while η-DV estimates a scalar. Sample Complexity. We discuss in Appendix B the sample complexity of η-MINE and show that by a reduction of the DV bound to f-div bound, the convergence results from (Nguyen, Wainwright, and Jordan 2008) apply and we have a sample complexity of O(1/

N). Algorithm 1 outlines steps of η-MINE for MI estimation.

5 Experiments

Proposed η-MINE MI estimator is compared to existing baselines on few applications using synthetic and real data. MI estimation. We compared different MI estimators on three synthetically generated Gaussian datasets [5K training and 1K testing samples]. Each evaluated MI estimator was run 10 times and average performance ( standard deviation) is shown at the top of Fig. 3. Clearly, MI estimation in high dimensions is a challenging task, where the estimation error for all methods increases as the data dimensionality grows [red line shows true MI value]. Nevertheless, the proposed η-MINE achieves on average more accurate results compared to existing baselines. Moreover, the convex formulation of η-MINE has overall a better performance and fast convergence rate. This estimator has a linear witness function deﬁned as f( ) = w, Φ( ) using pre-trained ﬁxed feature map Φ( ) from regular η-MINE. In the experiments that follow, we compare the proposed estimator η-MINE to DV-MINE, as a main baseline approach. MI-regularized GAN. We investigate GAN training improvements with MI, especially its diminishing mode collapse, as addressed in (Belghazi et al. 2018) in Section 5.1. (Belghazi et al. 2018) uses a 25-Gaussian dataset to show improvements on GAN clustering by using MI objective for

Figure 3: Performance of different MI estimators on synthetic Gaussian data. Left: MI estimation. Data was sampled from 2-, 10and 20-dimensional Gaussian distributions with random means and random symmetric, positive-deﬁnite covariance matrices (i.e., random dependencies, which is a difﬁcult scenario). As we increase data complexity, difference between estimators decreases, although we observed that η-MINE (or its convex extension) on average performed better than baseline methods, converging to the true MI [red line] faster. Right: MI for GAN regularization. Top row: unconditional GAN baseline fails at capturing all 25 modes in the Gaussian mixture. Middle row: MI-regularized conditional GAN using DV-MINE (Belghazi et al. 2018) converges after 140K steps of the generator. We found this estimator to be sensitive to the hyper-parameters and unstable. Bottom row: MI-regularized conditional GAN using η-MINE; the model converges in 50K steps.

regularization. As in Info GAN (Chen et al. 2016), the conditional generator G is supplied with random codes c along with noise z; we maximize the mutual information between I(G(z, c), c) using η-MINE estimators. In Fig. 3, we establish that η-MINE recovers all modes within fewer steps than DV-MINE and with a more stable training. Self-supervised: Deep Info Max. In unsupervised or selfsupervised learning, the objective is to build a model without relying on labeled data but using an auxiliary task to learn informative features that can be useful for various downstream tasks. Here, we evaluate the effectiveness of the proposed η-MINE estimator for unsupervised feature learning using the recently proposed Deep Info Max method from (Hjelm et al. 2019). For feature representation we used an encoder similar to DCGAN (Radford, Metz, and Chintala 2015), shown in Fig. 4.a and evaluated results on CIFAR10 and STL10 datasets (STL10 images were scaled down to match CIFAR10 resolution). The encoder is trained by maximizing MI I(x , y ) between features from one shallow layer l (x = El(x)) and a deeper layer k (y = Ek(y)). We examined different layer combinations and found that the encoder composed of only the ﬁrst two convolutional layers give the best performance on the downstream tasks. As shown in Fig. 4.b the encoder features are passed through additional trainable neural network layers, whose job is to build a classiﬁer f(x , y ), discriminating cases when x are y are coming from the same image and

cases when x are y are unrelated. Finally, we attach a linear layer to the pre-trained and now ﬁxed encoder (see Fig. 4.c) to perform supervised training. Table 2 presents the results for two MI estimators: η-MINE and DV-MINE, whose loss functions are listed in Table 1. As can be seen, η-MINE-based pre-training performs competitively with DV-MINE, achieving overall better results on both datasets, showing practical beneﬁts of the proposed approach. Self-Supervised: Jigsaws with MI. The self-supervision Jigsaw pretext task (Noroozi and Favaro 2016) aims at solving an image jigsaw puzzle by predicting a scrambling per-

Train DV η Sup. DV η Sup.

C10 77.5 74.8 84.2 55.1 56.4 - S10 67.5 68.3 - 61.8 63.3 61.3

Table 2: Classiﬁcation accuracy (top1) results on CIFAR10 (C10) and STL10 (S10) for unsupervised pre-training task with DV-MINE and η-MINE using encoder in Fig. 4.b. For reference we list results for supervised (CE) training using full encoder in Fig. 4.a. η-MINE-based pre-training achieves better results, outperforming supervised model on S10.

3 x 32 x 32 64 x 16 x 16 128 x 8 x 8 256 x 4 x 4 1024 64

74.9% 53.1% 48.1%

73.6% 57.1% 44.5%

fixed trained

Figure 4: (a) Encoder architecture and classiﬁcation accuracy (top1) on CIFAR-10 dataset for different pre-trained encoders. Each number represents test accuracy of the system trained by maximizing MI between features from layers pointed by the corresponding arrows. Interestingly, the highest accuracy was obtained by pre-training encoder composed of just the ﬁrst two convolutional layers (see (b) and (c) for details of this process). (b) Model pre-training by maximizing MI between features from different layers [additionally transformed by a neural net, aimed at constructing witness function f( )]. (c) After pre-training, we ﬁx encoder and attach a trainable linear layer to perform supervised tasks.

E(x ) CE Loss Encoder Classiﬁer (a)

E(x ) Encoder

MI Loss (b)

x CE Loss Encoder Classiﬁer

Image Net class label

Jigsaw permutation label

Jigsaw permutation label

f(E(x ), y)

Figure 5: (a) Jigsaw CE training. (b) Jigsaw MI training. (c) Image Net Classiﬁcation CE training.

mutation π. From image X with x={x1 . . . x9} 3 3 jigsaw patches, and permutation y = πi : [1, 9] [1, 9], scrambled patches xπi = {xπi(1) . . . xπi(9)} generate the puzzle. Each patch is fed to encoder E which must learn meaningful representations so a classiﬁer CJ can solve the puzzle by predicting the scrambling order πi (Noroozi and Favaro 2016) (see Fig. 5.a). While this standard formulation relies on a CE-based classiﬁcation of the permutation, we propose to use MI Jigsaw, where an encoder E is trained to maximize I(E(xπi); πi) by using MI estimators DVand η-MINE, as seen in Fig. 5.b. A patch preprocessing similar to (Kolesnikov, Zhai, and Beyer 2019) avoids shortcuts based on color aberration, patch edge matching, etc.; for details of our implementation, see Appendix E. All models are built on a 10% subset of Image Net (128K train., 5K val., 1K classes) as proposed by (Kolesnikov, Zhai, and Beyer 2019). This is a larger set than Tiny Image Net (200 classes) used in many publications. E is a Res Net50 for all our experiments. In our Image Net target classiﬁcation task, E from CE and MI Jigsaw trainings are frozen (at 200 epochs) and followed by linear classiﬁer C (Fig. 5.c); an adequate setup for comparing encoders as argued by (Kolesnikov, Zhai, and Beyer 2019).

DV-MINE η-MINE CE

top1 8.5 1.3 11.0 1.1 12.9 0.3 top5 20.0 2.5 24.1 2.2 28.1 0.6 top10 27.8 3.1 32.7 2.2 37.7 0.6

Table 3: Image Net (10% subset) classiﬁcation accuracies (in %). DV-MINE and η-MINE use ﬁxed Encoder from MI training. CE uses a CE Jigsaw Encoder. Means std. deviations over 8 models from different initialization are reported.

Table 3 reports best accuracies for all models on target task for Cs trained for exactly 200 epochs. For all results, E is trained from Jigsaw task (CE or MI) and frozen, with only C trained as in (Kolesnikov, Zhai, and Beyer 2019). DVand η-MINE share the same f architecture. η-MINE gives better accuracy performance compared to DV-MINE. CE model trained from a Jigsaw-supervised encoder E provides an upper-bound for supervised performance for E from the Jigsaw task. Despite CE being better than ηand DV-MINE, η-MINE does a respectable job at learning a representation space for the target task, better than DV-MINE. Details of the Jigsaw experiments are given in Appendix E.

6 Conclusion

In this paper, we introduced a new lower bound on the KL divergence and demonstrated how it can improve the estimation of MI using neural networks. Theoretically, we proved that the dual of our η-DV formulation reduces to a constrained likelihood ratio estimation. In practice, the stability of ηMINE is due to its unbiased gradient. We tested our estimator η-MINE on synthetic data and applied it on various realworld tasks where MI can be used as a regularizer, or as an objective in self-supervised learning. We used η-MINE in unsupervised learning of representations through MI maximization and by solving Jigsaw puzzles.

References Alemi, A. A.; Fischer, I.; Dillon, J. V.; and Murphy, K. 2017. Deep Variational Information Bottleneck. In ICLR. Alemi, A. A.; Poole, B.; Fischer, I.; Dillon, J. V.; Saurous, R. A.; and Murphy, K. 2018. Fixing a Broken ELBO. In International Conference on Machine Learning. Argyriou, A.; Evgeniou, T.; and Pontil, M. 2008. Convex Multi-task Feature Learning. Mach. Learn. . Bach, F.; Jenatton, R.; and Mairal, J. 2011. Optimization with Sparsity-Inducing Penalties (Foundations and Trends(R) in Machine Learning). Hanover, MA, USA: Now Publishers Inc. ISBN 160198510X, 9781601985101. Barber, D.; and Agakov, F. V. 2003. The IM Algorithm: A Variational Approach to Information Maximization. In NIPS. Belghazi, M. I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; and Hjelm, D. 2018. Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning, 531 540. Bell, A. J.; and Sejnowski, T. J. 1995. An Informationmaximization Approach to Blind Separation and Blind Deconvolution. Neural Comput. . Blei, D. M.; Kucukelbir, A.; and Mc Auliffe, J. D. 2017. Variational Inference: A Review for Statisticians. Journal of the American Statistical Association abs/1601.00670. Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS. Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsupervised Visual Representation Learning by Context Prediction. 2015 IEEE International Conference on Computer Vision (ICCV) . Donsker, M.; and Varadhan, S. 1976. Asymptotic evaluation of certain Markov process expectations for large time. Communications on Pure and Applied Mathematics . Gretton, A.; Borgwardt, K. M.; Rasch, M. J.; Sch olkopf, B.; and Smola, A. 2012. A Kernel Two-sample Test. JMLR . Gretton, A.; Smola, A.; Huang, J.; Schmittfull, M.; Borgwardt, K.; and Sch olkopf, B. 2009. Covariate Shift by Kernel Mean Matching. In Dataset Shift in Machine Learning. MIT press. Gretton, A.; Sutherland, D.; and Jitkrittum, W. 2019. Interpretable comparison of distributions and models. In NIPS. Han, Y.; Jiao, J.; Weissman, T.; and Wu, Y. 2017. Optimal rates of entropy estimation over Lipschitz balls. ar Xiv preprint ar Xiv:1711.02141 . Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2019. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations. Hu, W.; Miyato, T.; Tokui, S.; Matsumoto, E.; and Sugiyama, M. 2017. Learning discrete representations via information maximizing self-augmented training. In Proceedings of the 34th International Conference on Machine Learning.

Kandasamy, K.; Krishnamurthy, A.; Poczos, B.; Wasserman, L.; et al. 2015. Nonparametric von mises estimators for entropies, divergences and mutual informations. In Advances in Neural Information Processing Systems, 397 405.

Kolchinsky, A.; Tracey, B. D.; and Wolpert, D. H. 2017. Nonlinear information bottleneck. ar Xiv preprint ar Xiv:1705.02436 .

Kolesnikov, A.; Zhai, X.; and Beyer, L. 2019. Revisiting selfsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1920 1929.

Kraskov, A.; St ogbauer, H.; and Grassberger, P. 2004a. Estimating mutual information. Physical review. E, Statistical, nonlinear, and soft matter physics .

Kraskov, A.; St ogbauer, H.; and Grassberger, P. 2004b. Estimating mutual information. Physical review E 69(6).

Krause, A.; Perona, P.; and Gomes, R. G. 2010. Discriminative Clustering by Regularized Information Maximization. In Advances in Neural Information Processing Systems 23.

Mohamed, S.; and Lakshminarayanan, B. 2016. Learning in Implicit Generative Models. ar Xiv:1610.03483 .

Moon, K. R.; Sricharan, K.; and Hero, A. O. 2017. Ensemble estimation of mutual information. In IEEE International Symposium on Information Theory, 3030 3034.

Nguyen, X.; Wainwright, M. J.; and Jordan, M. I. 2008. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In NIPS.

Noroozi, M.; and Favaro, P. 2016. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. Lecture Notes in Computer Science 69 84.

Nowozin, S.; Cseke, B.; and Tomioka, R. 2016. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. In NIPS.

Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748 .

Poole, B.; Ozair, S.; Van Den Oord, A.; Alemi, A.; and Tucker, G. 2019. On variational bounds of mutual information. In International Conference on Machine Learning, 5171 5180. PMLR.

Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv:1511.06434 .

Singh, S.; and P oczos, B. 2017. Nonparanormal information estimation. In Proceedings of the International Conference on Machine Learning, 3210 3219.

Sriperumbudur, B. K.; Fukumizu, K.; Gretton, A.; Scholkopf, B.; and Lanckriet, G. R. G. 2009. On integral probability metrics, φ-divergences and binary classiﬁcation. In ar Xiv preprint ar Xiv:0901.2698.

Sugiyama, M.; Nakajima, S.; Kashima, H.; Buenau, P. V.; and Kawanabe, M. 2008. Direct Importance Estimation with

Model Selection and Its Application to Covariate Shift Adaptation. In Advances in Neural Information Processing Systems 20, 1433 1440. Neur IPS. Tishby, N.; Pereira, F. C.; and Bialek, W. 1999. The Information Bottleneck Method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, 368 377. Tschannen, M.; Djolonga, J.; Rubenstein, P. K.; Gelly, S.; and Lucic, M. 2019. On mutual information maximization for representation learning. ar Xiv preprint ar Xiv:1907.13625 . Zeng, X.; Xia, Y.; and Tong, H. 2018. Jackknife approach to the estimation of mutual information. Proceedings of the National Academy of Sciences 115(40): 9956 9961.