# maxsliced_mutual_information__0972ebbd.pdf

Max-Sliced Mutual Information

Dor Tsur Ben-Gurion University Ziv Goldfeld Cornell University Kristjan Greenewald MIT-IBM Watson AI Lab

Quantifying dependence between high-dimensional random variables is central to statistical learning and inference. Two classical methods are canonical correlation analysis (CCA), which identifies maximally correlated projected versions of the original variables, and Shannon s mutual information, which is a universal dependence measure that also captures high-order dependencies. However, CCA only accounts for linear dependence, which may be insufficient for certain applications, while mutual information is often infeasible to compute/estimate in high dimensions. This work proposes a middle ground in the form of a scalable informationtheoretic generalization of CCA, termed max-sliced mutual information (m SMI). m SMI equals the maximal mutual information between low-dimensional projections of the high-dimensional variables, which reduces back to CCA in the Gaussian case. It enjoys the best of both worlds: capturing intricate dependencies in the data while being amenable to fast computation and scalable estimation from samples. We show that m SMI retains favorable structural properties of Shannon s mutual information, like variational forms and identification of independence. We then study statistical estimation of m SMI, propose an efficiently computable neural estimator, and couple it with formal non-asymptotic error bounds. We present experiments that demonstrate the utility of m SMI for several tasks, encompassing independence testing, multi-view representation learning, algorithmic fairness, and generative modeling. We observe that m SMI consistently outperforms competing methods with little-to-no computational overhead.

1 Introduction

Dependence measures between random variables are fundamental in statistics and machine learning for tasks spanning independence testing [1 3], clustering [4, 5], representation learning [6, 7], and self-supervised learning [8 10]. There is a myriad of measures quantifying different notions of dependence, with varying statistical and computational complexities. The simplest is the Pearson correlation coefficient [11], which only captures linear dependencies. At the other extreme is Shannon s mutual information [12], which is a universal dependence measure that is able to identify arbitrarily intricate dependencies. Despite its universality and favorable properties, accurately estimating mutual information from data is infeasible in high-dimensional settings. First, mutual information estimation rates suffers from the curse of dimensionality, whereby convergence rates deteriorate exponentially with dimension [13]. Additionally, computing mutual information requires integrating a log-likelihood ratio over a high-dimensional space, which is generally intractable.

Between these two extremes is the popular canonical correlation analysis (CCA) [14], which identifies maximally correlated linear projections of variables. Nevertheless, classical CCA still only captures linear dependence, which has inspired nonlinear extensions such as Hirschfeld Gebelein Rényi (HGR) maximum correlation [15 17], kernel CCA [18, 19], deep CCA [20, 7], and various other generalizations [21 24]. However, HGR is computationally infeasible, while kernel and deep CCA can be burdensome in high dimensions, as they require optimization over reproducing kernel Hilbert spaces or deep neural networks, respectively. To overcome these shortcomings, this work proposes

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

max-sliced mutual information (m SMI) a scalable information-theoretic extension of CCA that captures the full dependence structure while only requiring optimization over linear projections.

1.1 Contributions

The m SMI is defined as the maximal mutual information between linear projections of the variables. Namely, the k-dimensional m SMI between X and Y with values in Rdx and Rdy, respectively, is1

SIk(X; Y ) := sup (A,B) St(k,dx) St(k,dy) I(A X; B Y ),

where St(k, d) is the Stiefel manifold of d k matrices with orthonormal columns. Unlike the nonlinear CCA variants that use nonlinear feature extractors in the high-dimensional ambient spaces, m SMI retains the linear projections of CCA and captures nonlinear structures in the low-dimensional feature space. This is done by using the mutual information between the projected variables, rather than correlation, as the optimization objective. Beyond being considerably simpler from a computational standpoint, this crucial difference allows m SMI to identify the full dependence structure, akin to classical mutual information. m SMI can also be viewed as the maximized version of the average-sliced mutual information (a SMI) [25, 26], which averages I(A X; B Y ) with respect to (w.r.t.) the Haar measure over St(k, dx) St(k, dy). However, we demonstrate that compared to a SMI, m SMI benefits from improved neural estimation error bounds and a clearer interpretation.

We show that m SMI inherits important properties of mutual information, including identification of independence, tensorization, and variational forms. For jointly Gaussian (X, Y ), the optimal m SMI projections coincide with those of k-dimensional CCA [27], posing m SMI as a natural informationtheoretic generalization. Beyond the Gaussian case, the solutions differ and m SMI may yield more effective representations for downstream tasks due to the intricate dependencies captured by mutual information. We demonstrate this superiority empirically for multi-view representation learning.

For efficient computation, we propose an m SMI neural estimator based on the Donsker-Varadhan (DV) variational form [28]. Neural estimators have seen a surge in interest due to their scalability and compatibility with gradient-based optimization [29 36]. Our estimator employs a single model that composes the projections with the neural network approximation of the DV critic, and then jointly optimizes them. This results in both the estimated m SMI value and the optimal projection matrices. Building on recent analysis of neural estimation of f-divergences [37, 38], we establish non-asymptotic error bounds that scale as O k1/2(ℓ 1/2 + kn 1/2) , where ℓand n are the numbers of neurons and (X, Y ) samples, respectively. Equating ℓand n results in the (minimax optimal) parametric estimation rate, which highlights the scalability of m SMI and its compatibility to modern learning settings.

In our empirical investigation, we first demonstrate that our m SMI neural estimator converges orders of magnitude faster than that of a SMI [26]. This is because the latter requires (parallel) training of many neural estimators corresponding to different projection directions, while the m SMI estimator optimizes a single combined model. Notwithstanding the reduction in computational overhead, we show that m SMI outperforms average-slicing for independence testing. Next, we compare m SMI with deep CCA [20, 7] by examining downstream classification accuracy based on representations obtained from both methods in a multi-view learning setting. Remarkably, we observe that even the linear m SMI projections outperform nonlinear representations obtained from deep CCA. We also consider an application to algorithmic fairness under the infomin framework [39]. Replacing their generalized Pearson correlation objective with m SMI, we again observe superior performance in the form of more fair representations whose utility remains on par with the fairness-agnostic model. Lastly, we devise a max-sliced version of the Info GAN by replacing the classic mutual information regularizer with its max-sliced analog. We show that despite the low-dimensional projections, the max-sliced Info GAN successfully learns to disentangle the latent space and generates quality samples.

2 Background and Preliminaries

Notation. For a, b R, we use the notation a b = min{a, b} and a b = max{a, b}. For d 1, is the Euclidean norm in Rd. The Stiefel manifold of d k matrices with orthonormal columns

1The parameter k is fixed and small compared to the ambient dimensions dx, dy, often simply set as k = 1.

is denoted by St(k, d). For a d k matrix A, we use p A : Rd Rk for the orthogonal projection onto the row space of A. For A Rd k with rank(A) = r k d, we write σ1(A), . . . , σr(A) for its non-zero singular values, and assume without loss of generality (w.l.o.g.) that they are arranged in descending order. Similarly, the eigenvalues of a square matrix Σ Rd d are denoted by λ1(Σ), . . . , λd(Σ). Let P(Rd) denote the space of Borel probability measures on Rd. For µ, ν P(Rd), we use µ ν to denote a product measure, while spt(µ) designates the support of µ. All random variables throughout are assumed to be continuous w.r.t. the Lebesgue measure. For a measurable map f, the pushforward of µ under f is denoted by f µ = µ f 1, i.e., if X µ then f(X) f µ. For a jointly distributed pair (X, Y ) µXY P(Rdx Rdy), we write ΣX and ΣXY for covariance matrix of X and cross-covariance matrix of (X, Y ), respectively.

Canonical correlation analysis. CCA is a classical method for devising maximally correlated linear projections of a pair of random variables (X, Y ) µXY P(Rdx Rdy) via [14]

(θCCA, ϕCCA) = argmax (ϕ,θ) Rdx Rdy θ ΣXY ϕT p

θ ΣXXθϕTΣY Y ϕ = argmax (θ,ϕ) Rdx Rdy : θ ΣXθ=ϕTΣY ϕ=1

θ ΣXY ϕ, (1)

where the former objective is the correlation coefficient ρ(θ X, ϕTY ) between the projected variables and the equality follows from invariance of ρ to scaling. The global optimum has an analytic form as (θCCA, ϕCCA) = (Σ 1/2 X θ1, Σ 1/2 Y ϕ1), where (θ1, ϕ1) is the (unit-length) top leftand right-singular vector pair associated with the largest singular value of TXY := Σ 1/2 X ΣXY Σ 1/2 Y Rdx dy. This solution is efficiently computable in O((dx dy)3) time, given that the population correlation matrices are known. CCA extends to k-dimensional projections via the optimization [27]

max (A,B) Rdx k Rdy k: A ΣXA=B ΣY B=Ik

tr(A ΣXY B), (2)

with the optimal CCA matrices being (ACCA, BCCA) = (Σ 1/2 X Uk, Σ 1/2 Y Vk), where Uk and Vk are the matrices of the first k leftand right-singular vectors of TXY . The optimal objective value then becomes the sum of the top k singular values of TXY (namely, its Ky Fan k-norm).

Divergences and information measures. Let µ, ν P(Rd) satisfy µ ν, i.e., µ is absolutely continuous w.r.t. ν. The Kullback-Leibler (KL) divergence is defined as D(µ ν) := R

Rd log(dµ/dν)dµ. We have D(µ ν) 0, with equality if and only if (iff) µ = ν. Mutual information and differential entropy are defined from the KL divergence as follows. Let (X, Y ) µXY P(Rdx Rdy) and denote the corresponding marginal distributions by µX and µY . The mutual information between X and Y is given by I(X; Y ) := D(µXY µX µY ) and serves as a measure of dependence between those random variables. The differential entropy of X is defined as h(X) = h(µX) := D(µX Leb). Mutual information between (jointly) continuous variables and differential entropy are related via I(X; Y ) = h(X) + h(Y ) h(X, Y ); decompositions in terms of conditional entropies are also available [40].

3 Max-Sliced Mutual Information

We now define the k-dimensional m SMI, establish structural properties thereof, and explore the Gaussian setting and its connections to CCA. We focus here on the case of (linear) k-dimensional projections and discuss extensions to nonlinear slicing in Section 3.3. Definition 1 (Max-sliced mutual information). For 1 k dx dy, the k-dimensional m SMI between (X, Y ) µXY P(Rdx Rdy) is

SIk(X; Y ) := sup (A,B) St(k,dx) St(k,dy) I(A X; B Y ), (3)

where St(k, d) is the Stiefel manifold of d k matrices with orthonormal columns.

The m SMI measures Shannon s mutual information between the most informative k-dimensional projections of X and Y . It can be viewed as a maximized version of the a SMI SIk(X; Y ) from

[25, 26], defined as the integral of I(A X; B Y ) w.r.t. the Haar measure over St(k, dx) St(k, dy). For d = dx = dy, we have SId(X; Y ) = SId(X; Y ) = I(X; Y ) due to invariance of mutual information to bijections. The supremum in m SMI is achieved since the Stiefel manifold is compact and the function (A, B) 7 I(A X; B Y ) is Lipschitz and thus continuous (Lemma 2 of [26]). Remark 1 (Multivariate and conditional m SMI). The m SMI definition above extends to the multivariate and conditional cases as follows. Let (X, Y, Z) µXY Z P(Rdx Rdy Rdz). The k-dimensional multivariate and conditional m SMI functionals are, respectively, SIk(X, Y ; Z) := max A,B,C I(A X, B Y ; C Z) and SIk(X; Y |Z) := max A,B,C I(A X; B Y |C Z). Connections between SIk(X; Y ) and its multivariate and conditional versions are given in the proposition to follow. We also note that one may generalize the definition of SIk(X; Y ) to allow for projections into feature spaces of different dimensions, i.e., A St(kx, dx) and B St(ky, dy), for kx = ky. We expect our theory to extend to that case, but leave further exploration for future work.

In the spirit of m SMI, we define the max-sliced differential entropy. Definition 2 (Max-sliced entropy). The k-dimensional max-sliced (differential) entropy of X µX P(Rd) is shk(X) := shk(µ) := sup A St(k,d) h(A X).

An important property of classical differential entropy is the maximum entropy principle [40], which finds the highest entropy distribution within given class. In Appendix B, we study the max-sliced entropy maximizing distribution in several common scenarios. For instance, we show that shk is maximized by the Gaussian distribution under a fixed (mean and) covariance constraint. Namely, letting P1(m, Σ) := µ P(Rd) : spt(µ) = Rd , Eµ[X] = m , Eµ (X m)(X m) = Σ , we have argmaxµ P1(µ,Σ) shk(µ) = N(m, Σ). An intimate connection between max-sliced entropy and PCA is established in the sequel, under the Gaussian setting. Remark 2 (Sliced divergences). The slicing technique has originated as a means to address scalability issues concerning statistical divergences. Significant attention was devoted to sliced Wasserstein distances as discrepancy measures between probability distributions [41 47]. As such, the sliced Wasserstein distance differs from mutual information and its sliced variants, which quantify dependence between random variables, rather than discrepancy per se. Additionally, as Wasserstein distances are rooted in optimal transport theory, they heavily depend on the geometry of the underlying data space. Mutual information, on the other hand, is induced by the KL divergence, which only depends on the log-likelihood of the considered distributions and overlooks geometry.

3.1 Structural Properties

The following proposition lists useful properties of the m SMI, which are similar to those of the average-sliced variant (cf. [26, Proposition 1]) as well as Shannon s mutual information itself. Proposition 1 (Structural properties). The following properties hold: 1. Bounds: For any integers k1 < k2: SIk1(X; Y ) SIk1(X; Y ) SIk2(X; Y ) I(X; Y ).

2. Identification of independence: SIk(X; Y ) 0 with equality iff (X, Y ) are independent.

3. KL divergence representation: We have

SIk(X; Y ) = sup (A,B) St(k,dx) St(k,dy) D (p A, p B)#µXY (p A, p B)#µX µY ,

4. Sub-chain rule: For any random variables X1, . . . , Xn, Y , we have

SIk(X1, . . . , Xn; Y ) SIk(X1; Y ) +

SIk(Xi; Y |X1, . . . , Xi 1).

5. Tensorization: For mutually independent {(Xi,Yi)}n i=1, SIk {Xi}n i=1;{Yi}n i=1 = n P

SIk(Xi;Yi).

The proof follows by similar arguments to those in the average-sliced case, but is given for completeness in Supplement A.1. Of particular importance are Properties 2 and 3. The former renders m SMI sufficient for independence testing despite being significantly less complex than the classical mutual information between the high-dimensional variables. The latter, which represent m SMI as a supremized KL divergence, is the basis for neural estimation techniques explored in Section 4.

Remark 3 (Relation to average-SMI). Beyond the inequality relationship in Property 1 above, Proposition 4 in [25] (paraphrased) shows that for matrices Wx, Wy and vectors bx, by of appropriate dimensions, we have sup Wx,Wy,bx,by SI1(W x X + bx; W y Y + by) = SI1(X; Y ), and the relation readily extends to projection dimension k > 1. In words, optimizing the a SMI over linear transformations of the high-dimensional data vectors coincides with the max-sliced version. This further justifies the interpretation of SIk(X; Y ) as the information between the two most informative representations of X, Y in a k-dimensional feature space. It also suggests that m SMI is compatible for feature extraction tasks, as explored in Section 5.3 ahead.

3.2 Gaussian Max-SMI versus CCA

The m SMI is an information-theoretic extension of the CCA coefficient ρCCA(X, Y ), which is able to capture higher order dependencies. Interestingly, when (X, Y ) are jointly Gaussian, the two notions coincide. We next state this relation and provide a closed-form expression for the Gaussian m SMI.

Proposition 2 (Gaussian m SMI). Let X N(m X, ΣX) and Y N(m Y , ΣY ) be dx and dy dimensional jointly Gaussian vectors with nonsingular covariance matrices and cross-covariance ΣXY . For any k dx dy, we have

SIk(X; Y ) = I(A CCAX; B CCAY ) = 1

i=1 log 1 σi(TXY )2 , (4)

where (ACCA, BCCA) are the CCA solutions from (2), TXY = Σ 1/2 X ΣXY Σ 1/2 Y Rdx dy, and σk(TXY ) . . . σ1(TXY ) 1 are the top k singular values of TXY (ordered).

This proposition is proven in Supplement A.2. We first show that the optimization domain of SIk(X; Y ) can be switched from the product of Stiefel manifolds to the space of all matrices subject to a unit variance constraint (akin to (2)), without changing the m SMI value. This implies that the CCA solutions (ACCA, BCCA) from (2) are feasible for m SMI and we establish their optimality using a generalization of the Poincaré separation theorem [48, Theorem 2.2]. Specializing Proposition 2 to one-dimensional projections, i.e., when k = 1, the m SMI is given in terms of the canonical correlation coefficient ρCCA(X, Y ) := sup(ϕ,θ) Rdx Rdy ρ(θ X, ϕTY ). Namely,

SI1(X; Y ) = I(θ CCAX; ϕ CCAY ) = 0.5 log 1 ρCCA(X, Y )2 ,

where (θCCA, ϕCCA) are the global optimizers of ρCCA(X, Y ).

Remark 4 (Beyond Gaussian data). While the m SMI solution coincides with that of CCA in the Gaussian case, this is no longer expected to hold for non-Gaussian distributions. CCA is designed to maximize correlation, while m SMI has Shannon s mutual information between the projected variables as the optimization objective. Unlike correlation, mutual information captures higher order dependencies between the variables, and hence the optimal m SMI matrices will not generally coincide with (ACCA, BCCA). Furthermore, the intricate dependencies captured by mutual information suggest that the optimal m SMI projections may yield representations that are more effective for downstream tasks. We empirically verify this observation in Section 5 on several tasks, including classification, multi-view representation learning, and algorithmic fairness.

Similarly to the above, the Gaussian max-sliced entropy is related to PCA [49, 14]. In Supplement A.3, we prove the following.

Proposition 3 (Gaussian max-sliced entropy). For a d-dimensional Gaussian variable X N(m, Σ), we have shk(X) = sup A St(k,d) h(A X) = h(A PCAX) = 0.5 Pk i=1 log 2πeλi(Σ) , where APCA is optimal PCA matrix and λ1(Σ), . . . λk(Σ) are the top k eigenvalues of Σ.

Note that the eigenvalues λ1(Σ), . . . λk(Σ) are non-negative since Σ is a covariance matrix. Extrapolating beyond the Gaussian case, this poses max-sliced entropy as an information-theoretic generalization of PCA for unsupervised dimensionality reduction. An analogous extension using the Rényi entropy of order 2 was previously considered in [50] for the purpose of binary classification. In that regard, shk(X) can be viewed as the α-Rényi variant when α 1.

3.3 Generalizations Beyond Linear Slicing

The notion of m SMI readily generalizes beyond linear slicing. Fix dx, dy 1, k dx dy, and consider two (nonempty) function classes G {g : Rdx Rk} and H {h : Rdy Rk}.

Definition 3 (Generalized m SMI). The generalized m SMI between (X, Y ) µXY P(Rdx Rdy) w.r.t. the classes G and H is SIG,H(X; Y ) := sup(g,h) G H I g(X); h(Y ) .

The generalized variant reduces back to SIk(X; Y ) by taking G = Gproj := {p A : A St(k, dx)} and H = Hproj := {p B : B St(k, dy)}, but otherwise allows more flexibility in the way (X, Y ) are mapped into Rk. We also have that if G G and H H , then SIG,H(X; Y ) SIG ,H (X; Y ) I(X; Y ), which corresponds to Property 1 from Proposition 1. Further observations are as follows.

Proposition 4 (Properties). For any classes G, H, we have that SIG,H always satisfies Properties 3-5 from Proposition 1. If further Gproj G and Hproj H, then SIG,H also satisfies Property 2.

We omit the proof as it follows by the same argument as Proposition 1, up to replacing the linear projections with the functions (g, h) G H. In practice, the classes G and H are chosen to be parametric, typically realized by artificial neural networks. As discussed in Remark 5 ahead, this is well-suited to the neural estimation framework for m SMI (both standard and generalized). Lastly, note that SIG,H(X; Y ) corresponds to the objective of multi-view representation learning [51], which considers the maximization of the mutual information between NN-based representation of the considered variables. We further investigate this relation in Section 5.3.

4 Neural Estimation of Max-SMI

We study estimation of m SMI from data, seeking an efficiently computable and scalable approach subject to formal performance guarantees. Towards that end, we observe that the m SMI is compatible with neural estimation [29, 38] due to its convenient variational form. In what follows we derive the neural estimator, describe the algorithm to compute it, and provide non-asymptotic error bounds.

4.1 Estimator and Algorithm

Fix dx, dy 1, k dx dy, and µXY P(Rdx Rdy); we suppress k, dx, dy from our notation of the considered function classes. Neural estimation is based on the DV variational form:2

I(X; Y ) = sup f F LDV(f; µXY ), LDV(f; µXY ) := E[f(X, Y )] log e E[f( X, Y )] ,

where (X, Y ) µXY , ( X, Y ) µX µY , and F is the class of all measurable functions f : Rdx Rdy R (often referred to as DV potentials) for which the expectations above are finite. As m SMI is the maximal mutual information between projections of X, Y , we have

SIk(X; Y ) = sup (A,B) St(k,dx) St(k,dy) sup f F LDV f; (p A, p B) µXY = sup f F proj LDV(f; µXY ),

where F proj := f (p A, p B) : f F, (A, B) St(k, dx) St(k, dy) . The RHS above is given by optimizing the DV objective LDV over the composed class F proj, which first projects (X, Y ) 7 (A X, B Y ) and then applies a DV potential f : Rk Rk R to the projected variables.

Neural estimator. Neural estimators parametrize the DV potential by neural nets, approximate expectations by sample means, and optimize the resulting empirical objective over parameter space. Let Fnn be a class of feedforward networks with input space Rk Rk and real-valued outputs.3 Given i.i.d. samples (X1, Y1), . . . , (Xn, Yn) from µXY , we first generate negative samples (i.e., from µX µY ) by taking (X1, Yσ(1)), . . . , (Xn, Yσ(n)), where σ Sn is a permutation such that

2One may instead use the form that stems from convex duality: I(U;V ) = supf E[f(U, V )] E ef( U, V ) 1 . 3For now, we leave the architecture (number of layers/neurons, parameter bounds, nonlinearity) implicit to allow flexibility of implementation; we will specialize to a concrete class when providing theoretical guarantees.

σ(i) = i, for all i = 1, . . . , n. The neural estimator of SIk(X; Y ) is now given by

b SI Fnn k (Xn, Y n) := sup f Fproj nn

i=1 f(Xi, Yi) log

i=1 ef(Xi,Yσ(i)) !

where Fproj nn := f (p A, p B) : f Fnn, (A, B) St(k, dx) St(k, dy) is the composition of the neural network class with the projection maps. The projection maps can be absorbed into the network architecture as a first linear layer that maps the (dx + dy)-dimensional input to a 2k-dimensional feature vector, which is then further processed by the original f Fnn network. Note that projection onto the Stiefel manifold can be efficiently and differentiably computed via QR decomposition. Hence, the Stiefel manifold constraint can be easily enforced by setting A, B to be the projections of unconstrained d k matrices. Further details on the implementation are given in Supplement C. Remark 5 (Nonlinear slicing). For learning tasks that may need more expressive representations of (X, Y ), one may employ the nonlinear m SMI variant from Section 3.3. In practice, the classes G = {gθ} and H = {hϕ} are taken to be parametric, realized by neural networks. These networks naturally compose with the DV critic fψ, resulting in a single compound model fψ (gθ, hϕ).

4.2 Performance Guarantees

Neural estimation involves three sources of error: (i) function approximation of the DV potential; (ii) empirical estimation of the means; and (iii) optimization, which comes from employing suboptimal (e.g., gradient-based) routines. Our analysis provides sharp non-asymptotic bounds for errors of type (i) and (ii), leaving the account of the optimization error for future work. We focus on a class of ℓ-neuron shallow Re LU networks, although the ideas extend to other nonlinearities and deep architectures. Define Fℓ nn as the class of all f : Rk Rk R, f(z) = Pℓ i=1 βiϕ ( wi, z + bi) + w0, z + b0, whose parameters satisfy max1 i ℓ wi 1 |bi| 1, max1 i ℓ|βi| aℓ 2ℓ, and |b0|, w0 1 aℓ, where ϕ(z) = z 0 is the Re LU activation and aℓ= log log ℓ 1.

Consider the neural m SMI estimator b SI n,ℓ k := b SI Fℓ nn k (Xn, Y n) (see (5)). We provide convergence rates for it over an appropriate distribution class, drawing upon the results of [37] for neural estimation of f-divergences. For compact X Rdx and Y Rdy, let Pac(X Y) be the set of all Lebesgue absolutely continuous joint distribution µXY with spt(µXY ) X Y. Denote the Lebesgue density of µXY by f XY . The distribution class of interest is4

Pk(M, b) := µXY Pac(X Y) : r Ck+3 b (U) for some open set U X Y s.t. log f XY = r|X Y, I(X; Y ) M

which, in particular, contains distributions whose densities are bounded from above and below on X Y with a smooth extension to an open set covering X Y. This includes uniform distributions, truncated Gaussians, truncated Cauchy distributions, etc. The following theorem provides the convergence rate for the m SMI neural estimator, uniformly over Pk(M, b). Theorem 1 (Neural estimation error). For any M, b 0, we have

sup µX,Y Pk(M,b) E h SIk(X; Y ) b SI n,ℓ k i Ck 1 2 ℓ 1

where C depends on M, b, k, and the radius of the ambient space X Y := sup(x,y) X Y (x, y) .

The theorem is proven in Supplement A.4 by adapting the error bound from [38, Proposition 2] to hold for I(A X; B Y ), uniformly over (A, B) St(k, dx) St(k, dy). To that end, we show that for any µXY Pk(b, M), the log-density of (A X, B Y ) (p A, p B) µXY admits an extension (to an open set containing the support) with k + 3 continuous and uniformly bounded derivatives. Remark 6 (Parametric rate and optimality). Taking ℓ n, the resulting rate in Theorem 1 is parametric, and hence minimax optimal. This result implicitly assumes that M is known when picking the neural net parameters. This assumption can be relaxed to mere existence of (an unknown) M, resulting in an extra polylog(ℓ) factor multiplying the n 1/2 term.

4Here, Cs b(U) := {f Cs(U) : maxα: α 1 s Dαf ,U b}, where Dα, α = (α1, . . . , αd) Zd 0, is the partial derivative operator of order Pd i=1 αi. The restriction of f : Rd R to X Rd is f|X .

Remark 7 (Comparison to average-SMI). Neural estimation of classic mutual information under the framework of [38] requires the density to have Hölder smoothness s (dx + dy)/2 + 3. For SIk(X; Y ), smoothness of k + 3 is sufficient (even though the ambient dimension is the same), which means it can be estimated over a larger distribution class. Similar gains in terms of smoothness levels were observed for a SMI in [26]. Nevertheless, we note that m SMI is more compatible with neural estimation than average-slicing [25, 26]. The m SMI neural estimator integrates the max-slicing into the neural network architecture and optimizes a single objective. The a SMI neural estimator from [26] requires an additional Monte Carlo integration step to approximate the integral over the Steifel manifolds. This results in an extra k1/2m 1/2 term in the error bound, where m is the number of Monte Carlo samples, introducing a burdensome computational overhead (see Section 5.1). Remark 8 (Non-Re LU networks). Theorem 1 employs the neural estimation bound from [38], which relies on [52] to control the approximation error. As noted in [38], their bound extends to any other sigmoidal bounded activation with limz σ(z) = 0 and limz σ(z) = 1 by appealing to the approximation bound from [53] instead. Doing so would allow relaxing the smoothness requirement on the extension to r Ck+2 b in (6), but at the expense of scaling the hidden layer parameters as ℓ1/2 log ℓ(as opposed to the Re LU-based bound, where the parameter scale is independent of ℓ).

5 Experiments

5.1 Neural Estimation

103 104 2.1 2.2 2.3 2.4 2.5

103 104 0.15

100 101 102

Time [sec.]

Figure 1: Neural estimation performance with ρ = 0.5. Convergence vs. n in upper figures and average epoch time vs. n in lower figure.

We compare the performance of neural estimation methods for m SMI and a SMI on a synthetic dataset of correlated Gaussians. Let X, Z N(0, 1) be i.i.d. and set Y = ρX + p

1 ρ2Z, for ρ (0, 1). The goal is to estimate the k-dimensional m SMI and a SMI between (X, Y ). We train our m SMI neural estimator and the a SMI neural estimator from [26, Section 4.2] based on n i.i.d. samples, and compare their performance as a function of n. Both average and maxsliced algorithms converge at similar rates; however, a SMI has significantly higher time complexity due to the need to train multiple neural estimators (one for each projection direction). This is shown in Figure 1, where we compare the average epoch time for each algorithm against the dataset size. Implementation details are given in Supplement C.

5.2 Independence Testing

In this experiment, we compare m SMI and a SMI for independence testing. We follow the setting from [26, Section 5], generating d-dimensional samples correlated in a latent d -dimensional subspace and estimating the information measure to determine dependence. We estimate the a SMI with the method from [26], using m = 1000 Monte Carlo samples and the Kozachenko-Leonenko estimator for the mutual information between the projected variables [54]. We then compute AUC-ROC over 100 trials,

101 102 103 0.5

101 102 103

101 102 103

101 102 103

101 102 103

101 102 103

Figure 2: ROC-AUC comparison. Dashed and solid lines show results for a SMI [26] and m SMI (ours), respectively. Blue plots correspond to (d, d ) = (10, 4), while red correspond to (d, d ) = (20, 6).

considering various ambient and projected dimensions. For m SMI, as we cannot differentiate through the Kozachenko-Leonenko estimator, we resort to gradient-free methods. We employ the LIPO algorithm from [55] with a stopping criterion of 1000 samples. This choice is motivated by the Lipschitzness of (A, B) 7 I(A X; B Y ) w.r.t. the Frobenius norm on St(k, dx) St(k, dy) (cf. [26, Lemma 2]). Figure 2 shows that when k > 2, m SMI captures independence better than a SMI, particularly in the lower sample regime. We hypothesize that this is due to the fact that the shared signal lies in a low-dimensional subspace, which m SMI can isolate and perhaps better exploit than a SMI, which averages over all subspaces. When k is much smaller than the shared signal dimension d , m SMI fails to capture all the information and a SMI, which takes all slices into account, may be preferable. Results are averaged over 10 seeds. Further implementation details are in Supplement C.

5.3 Multi-View Representation Learning k Linear CCA Linear m SMI MLP DCCA MLP m SMI

1 0.261 0.03 0.274 0.02 0.284 0.03 0.291 0.02 2 0.32 0.02 0.346 0.02 0.314 0.03 0.417 0.02 4 0.42 0.01 0.478 0.02 0.441 0.04 0.546 0.01 8 0.553 0.03 0.666 0.01 0.645 0.02 0.665 0.01 12 0.614 0.02 0.751 0.01 0.697 0.01 0.753 0.01 16 0.673 0.02 0.775 0.01 0.730 0.02 0.779 0.01 20 0.704 0.007 0.79 0.006 0.774 0.01 0.798 0.01

Table 1: Downstream classification accuracy from MNIST representations by CCA and m SMI.

We next explore m SMI as an informationtheoretic generalization of CCA by examining its utility in multi-view representation learning a popular CCA application. Without using class labels, we obtain m SMI-based k-dimensional representations of the top and bottom halves of MNIST images (considered as two separate views of the digit image). This is done by computing the k-dimensional m SMI between the views and using the maximizing projected variables as the representations. We compare to similarly obtained CCA-based representations, following the method of [20]. Both linear and nonlinear (parameterized by an MLP neural network) slicing models are optimized with similar initialization and data but different loss functions. Performance is evaluated via downstream 10-class classification accuracy, utilizing the learned top-half representations. Results are averaged over 10 seeds. As shown in Table 1, m SMI outperforms CCA for learning meaningful representations. Interestingly, linear representations learned by m SMI outperform nonlinear representations from the CCA methodology, demonstrating the potency of m SMI. Full implementation details and additional results are given in Supplements C and D, respectively.

The a SMI is not considered for this experiment since it does not provide a concrete latent space representation (as it is an averaged quantity). Moreover, if one were to maximize a SMI as an objective to derive such representations, this would simply lead back to computing m SMI; cf. Remark 3.

5.4 Learning Fair Representations

Table 2: Learning a fair representation of the US Census Demographic dataset, following the setup of [39]. Results are shown as the median over 10 runs with random data splits. The fairest result is k = 6.

N/A Slice [39] m SMI (ours) k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 ρHGR(Z, Y ) 0.949 0.967 0.955 0.958 0.952 0.942 0.940 0.957 0.933 ρHGR(Z, A) 0.795 0.116 0.220 0.099 0.067 0.048 0.029 0.026 0.047

Another common application of dependence measures is learning fair representations of data. We seek a data transformation Z = f(X) that is useful for predicting some outcome or label Y , while being statistically independent of some sensitive attribute A (e.g., gender, race, or religion of the subject). In other words, a fair representation is one that is not affected by the subjects protected attributes so that downstream predictions are not biased against protected groups, even if the training data may have been biased. Following the setup of [39], we measure utility and fairness using the HGR maximal correlation ρHGR( , ) = suph,g ρ h( ), g( ) , seeking large ρHGR(Z, Y ) and small ρHGR(Z, A) where h and g are parameterized by neural networks. As solving this minimax problem directly is difficult in practice, following [39] we learn Z by optimizing the bottleneck equation ρHGR(Z, Y ) βSIk(Z, A), where we use a neural estimator for the m SMI and β, k are hyperparameters.

Table 2 shows results on the US Census Demographic dataset extracted from the 2015 American Community Survey, which has 37 features collected over 74,000 census tracts. Here Y is the fraction of children below the poverty line in a tract, and A is the fraction of women in the tract. Following the same experimental setup as [39], the learned Z is 80-dimensional. As [39] showed that their Slice approach significantly outperformed all other baselines on this experiments under a computational

(a) Fashion MNIST

(b) MNIST, k = 5

(c) MNIST, k = 10

Figure 3: MNIST images generated via the max-sliced Info GAN.

constraint5, we apply the same computational constraint to our approach and compare only to Slice and to the N/A fairness-agnostic model trained on the bottleneck objective with β = 0. Note that for k > 1, m SMI learns a more fair representation Z (lower ρHGR(Z, A)) than Slice, while retaining a utility ρHGR(Z, Y ) on par with the fairness agnostic N/A model. We emphasize that due to the reasons outlined in Section 5.3, a SMI is not suitable for the considered task and is thus not included in the comparison. Results on the Adult dataset are shown in Supplement E.

5.5 Max-Sliced Info GAN

We present an application of max-slicing to generative modeling under the Info GAN framework [56]. The Info GAN learns a disentangled latent space by maximizing the mutual information between a latent code variable and the generated data. We revisit this architecture but replace the classical mutual information regularizer in the Info GAN objective with m SMI. Our max-sliced Info GAN is tested on the MNIST and Fashion-MNIST datasets. Figure 3 presents the generated samples for several projection dimensions. We consider 3 latent codes (C1, C2, C3), which automatically learn to encode different features of the data. We vary the values of C1, which is a 10-state discrete variable, along the column (and consider random values of (C2, C3) along the rows). Evidently, C1 successfully disentangles the 10 class labels and the quality of generated samples is on par with past implementations [56, 26]. We stress that since m SMI relies on low-dimensional projections, the resulting Info GAN mutual information estimator uses a reduced number of parameters (at the negligible cost of optimizing over linear projections). Additional details are given in Supplement C.

6 Conclusion

This paper proposed m SMI, an information theoretic generalization of CCA. m SMI captures the full dependence structure between two high dimensional random variables, while only requiring an optimized linear projection of the data. We showed that m SMI inherits important properties of Shannon s mutual information and that when the random variables are Gaussian, the m SMI optimal solutions coincide with classic k-dimensional CCA. Moving beyond Gaussian distributions, we present a neural estimator of m SMI and establish non-asymptotic error bounds.

Through several experiments we demonstrate the utility of m SMI for tasks spanning independence testing, multi-view representation learning, algorithmic fairness and generative modeling, showing it outperforms popular methodologies. Possible future directions include an investigation of an operational meaning of m SMI, either in information theoretic or physical terms, extension of the proposed formal guarantees to the nonlinear setting, and the extension of the neural estimation convergence guarantees to deeper networks. Additionally, m SMI can provide a mathematical foundation to mutual information-based representation learning, a popular area of self-supervised learning [10, 57].

In addition to the above, we plan to develop a rigorous theory for the choice of k, which is currently devised empirically and is treated as a hyperparameter. When the support of the distributions lies in some d < d dimensional subspace, the choice of k = d is sufficient to recover the classical mutual information, and therefore it characterizes the full dependence structure. Extrapolating from this point, we conjecture that the optimal value of k is related to the intrinsic dimension of the data distribution, even when it is not strictly supported on a low-dimensional subset.

5Runtime per iteration not to exceed the runtime of Slice per iteration. We used an NVIDIA V100 GPU.

[1] Sidney Siegel. Nonparametric statistics. The American Statistician, 11(3):13 19, 1957.

[2] Larry D Haugh. Checking the independence of two covariance-stationary time series: a univariate residual cross-correlation approach. Journal of the American Statistical Association, 71(354):378 385, 1976.

[3] Thomas B Berrett and Richard J Samworth. Nonparametric independence testing via mutual information. Biometrika, 106(3):547 566, 2019.

[4] Atul J Butte and Isaac S Kohane. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In Biocomputing 2000, pages 418 429. World Scientific, 1999.

[5] Alexander Kraskov, Harald Stögbauer, Ralph G Andrzejak, and Peter Grassberger. Hierarchical clustering using mutual information. Europhysics Letters, 70(2):278, 2005.

[6] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

[7] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view representation learning. In International conference on machine learning, pages 1083 1092. PMLR, 2015.

[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597 1607. PMLR, 2020.

[9] Jure Zbontar, Li Jing, Ishan Misra, Yann Le Cun, and Stéphane Deny. Barlow twins: Selfsupervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310 12320. PMLR, 2021.

[10] Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, et al. A cookbook of self-supervised learning. ar Xiv preprint ar Xiv:2304.12210, 2023.

[11] Karl Pearson. Vii. note on regression and inheritance in the case of two parents. proceedings of the royal society of London, 58(347-352):240 242, 1895.

[12] Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379 423, 1948.

[13] Liam Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1191 1253, June 2003.

[14] Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.

[15] Hermann O Hirschfeld. A connection between correlation and contingency. Mathematical Proceedings of the Cambridge Philosophical Society, 31(4):520 524, 1935.

[16] Hans Gebelein. Das statistische problem der korrelation als variations-und eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung. ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 21(6):364 379, 1941.

[17] Alfréd Rényi. On measures of dependence. Acta Mathematica Academiae Scientiarum Hungarica, 10(3-4):441 451, 1959.

[18] Shotaro Akaho. A kernel method for canonical correlation analysis. ar Xiv preprint cs/0609071, 2006.

[19] David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12):2639 2664, 2004.

[20] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In International conference on machine learning, pages 1247 1255. PMLR, 2013.

[21] Francis R Bach and Michael I Jordan. A probabilistic interpretation of canonical correlation analysis. Technical Report, 2005.

[22] Tae-Kyun Kim, Josef Kittler, and Roberto Cipolla. Learning discriminative canonical correlations for object recognition with image sets. In Computer Vision ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part III 9, pages 251 262. Springer, 2006.

[23] Elena Parkhomenko, David Tritchler, and Joseph Beyene. Sparse canonical correlation analysis with application to genomic data integration. Statistical applications in genetics and molecular biology, 8(1), 2009.

[24] Amichai Painsky, Meir Feder, and Naftali Tishby. Nonlinear canonical correlation analysis: A compressed representation approach. Entropy, 22(2):208, 2020.

[25] Ziv Goldfeld and Kristjan Greenewald. Sliced mutual information: A scalable measure of statistical dependence. Advances in Neural Information Processing Systems, 34:17567 17578, 2021.

[26] Ziv Goldfeld, Kristjan Greenewald, Theshani Nuradha, and Galen Reeves. k-sliced mutual information: A quantitative study of scalability with dimension. Advances in Neural Information Processing Systems, 35:15982 15995, 2022.

[27] Kantilal Varichand Mardia, John T Kent, and John M Bibby. Multivariate analysis. Probability and mathematical statistics, 1979.

[28] Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain Markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183 212, 1983.

[29] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In International Conference on Machine Learning, pages 531 540. PMLR, 2018.

[30] Ben Poole, Sherjil Ozair, Aäron van den Oord, Alexander A Alemi, and George Tucker. On variational lower bounds of mutual information. In Neur IPS Workshop on Bayesian Deep Learning, 2018.

[31] Chung Chan, Ali Al-Bashabsheh, Hing Pang Huang, Michael Lim, Da Sun Handason Tam, and Chao Zhao. Neural entropic estimation: A faster path to mutual information estimation. ar Xiv preprint ar Xiv:1905.12957, 2019.

[32] Jiaming Song and Stefano Ermon. Understanding the limitations of variational mutual information estimators. Proceedings of the International Conference on Learning Representations (ICLR), 2020.

[33] Jingjing Zhang, Osvaldo Simeone, Zoran Cvetkovic, Eugenio Abela, and Mark Richardson. Itene: Intrinsic transfer entropy neural estimator. ar Xiv preprint ar Xiv:1912.07277, 2019.

[34] Sudipto Mukherjee, Himanshu Asnani, and Sreeram Kannan. Ccmi: Classifier based conditional mutual information estimation. In Uncertainty in artificial intelligence, pages 1083 1093. PMLR, 2020.

[35] Dor Tsur, Ziv Aharoni, Ziv Goldfeld, and Haim Permuter. Neural estimation and optimization of directed information over continuous spaces. IEEE Transactions on Information Theory, 2023.

[36] Qing Guo, Junya Chen, Dong Wang, Yuewei Yang, Xinwei Deng, Jing Huang, Larry Carin, Fan Li, and Chenyang Tao. Tight mutual information estimation with contrastive fenchel-legendre optimization. Advances in Neural Information Processing Systems, 35:28319 28334, 2022.

[37] Sreejith Sreekumar, Zhengxin Zhang, and Ziv Goldfeld. Non-asymptotic performance guarantees for neural estimation of f-divergences. In International Conference on Artificial Intelligence and Statistics, pages 3322 3330. PMLR, 2021.

[38] Sreejith Sreekumar and Ziv Goldfeld. Neural estimation of statistical divergences. Journal of Machine Learning Research, 23(126):1 75, 2022.

[39] Yanzhi Chen, Weihao Sun, Yingzhen Li, and Adrian Weller. Scalable infomin learning. Advances in Neural Information Processing Systems, 35:2226 2239, 2022.

[40] Thomas M Cover and A Joy Thomas. Elements of Information Theory. Wiley, New-York, 2nd edition, 2006.

[41] Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In Scale Space and Variational Methods in Computer Vision: Third International Conference, SSVM 2011, Ein-Gedi, Israel, May 29 June 2, 2011, Revised Selected Papers 3, pages 435 446. Springer, 2012.

[42] Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo, Zhizhen Zhao, David Forsyth, and Alexander G Schwing. Max-sliced wasserstein distance and its use for gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10648 10656, 2019.

[43] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. Advances in neural information processing systems, 32, 2019.

[44] Tianyi Lin, Chenyou Fan, Nhat Ho, Marco Cuturi, and Michael Jordan. Projection robust wasserstein distance and riemannian optimization. Advances in neural information processing systems, 33:9383 9397, 2020.

[45] Minhui Huang, Shiqian Ma, and Lifeng Lai. A riemannian block coordinate descent method for computing the projection robust wasserstein distance. In International Conference on Machine Learning, pages 4446 4455. PMLR, 2021.

[46] Tianyi Lin, Zeyu Zheng, Elynn Chen, Marco Cuturi, and Michael I Jordan. On projection robust optimal transport: Sample complexity and model misspecification. In International Conference on Artificial Intelligence and Statistics, pages 262 270. PMLR, 2021.

[47] Sloan Nietert, Ziv Goldfeld, Ritwik Sadhu, and Kengo Kato. Statistical, robustness, and computational guarantees for sliced wasserstein distances. Advances in Neural Information Processing Systems, 35:28179 28193, 2022.

[48] C Radhakrishna Rao. Separation theorems for singular values of matrices and their applications in multivariate analysis. Journal of Multivariate Analysis, 9(3):362 377, 1979.

[49] Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11):559 572, 1901.

[50] Wojciech Marian Czarnecki, Rafal Jozefowicz, and Jacek Tabor. Maximum entropy linear manifold for learning discriminative low-dimensional representation. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part I 15, pages 52 67. Springer, 2015.

[51] Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. Proceedings of the International Conference on Learning Representations (ICLR), 2020.

[52] Jason M Klusowski and Andrew R Barron. Approximation by combinations of relu and squared relu ridge functions with ℓ1 and ℓ0 controls. IEEE Transactions on Information Theory, 64(12):7649 7656, 2018.

[53] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930 945, 1993.

[54] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Physical review E, 69(6):066138, 2004.

[55] Cédric Malherbe and Nicolas Vayatis. Global optimization of lipschitz functions. In International Conference on Machine Learning, pages 2314 2323. PMLR, 2017.

[56] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016.

[57] Ravid Shwartz-Ziv and Yann Le Cun. To compress or not to compress self-supervised learning and information theory: A review. ar Xiv preprint ar Xiv:2304.09355, 2023.

[58] Yury Polyanskiy and Yihong Wu. Information Theory: From Coding to Learning. Cambridge University Press, 2022.

[59] Jan R Magnus and Heinz Neudecker. Matrix differential calculus with applications in statistics and econometrics. John Wiley & Sons, 2019.

[60] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). ar Xiv preprint ar Xiv:1511.07289, 2015.

[61] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[62] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014.

[63] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.