# identifiable_deep_generative_models_via_sparse_decoding__fd22ecba.pdf Published in Transactions on Machine Learning Research (09/2022) Identifiable Deep Generative Models via Sparse Decoding Gemma E. Moran gm2918@columbia.edu Columbia University Dhanya Sridhar Mila - Quebec AI Institute and Université de Montréal Yixin Wang University of Michigan David M. Blei Columbia University Reviewed on Open Review: https: // openreview. net/ forum? id= vd0on GWZb E We develop the sparse VAE for unsupervised representation learning on high-dimensional data. The sparse VAE learns a set of latent factors (representations) which summarize the associations in the observed data features. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. As examples, in ratings data each movie is only described by a few genres; in text data each word is only applicable to a few topics; in genomics, each gene is active in only a few biological processes. We prove such sparse deep generative models are identifiable: with infinite data, the true model parameters can be learned. (In contrast, most deep generative models are not identifiable.) We empirically study the sparse VAE with both simulated and real data. We find that it recovers meaningful latent factors and has smaller heldout reconstruction error than related methods. 1 Introduction In many domains, high-dimensional data exhibits variability that can be summarized by low-dimensional latent representations, or factors. These factors can be useful in a variety of tasks including prediction, transfer learning, and domain adaptation (Bengio et al., 2013). To learn such factors, many researchers fit deep generative models (DGMs) (Kingma and Welling, 2014; Rezende et al., 2014). A DGM models each data point with a latent low-dimensional representation (its factors), which is passed through a neural network to generate the observed features. Given a dataset, a DGM can be fit with a variational autoencoder (VAE), a method that provides both the fitted parameters to the neural network and the approximate posterior factors for each data point. In this paper, we make two related contributions to the study of deep generative models. First, we develop the sparse DGM. This model is a DGM where each observed feature only depends on a subset of the factors that is, the mapping from factors to features is sparse. This notion of sparsity often reflects the type of underlying patterns that we anticipate finding in real-world data. In genomics, each gene is associated with a few biological processes; in text, each term is applicable to a few underlying topics; in movie ratings, each movie is associated with a few genres. In detail, the model implements sparsity through a per-feature masking variable. When generating a data point, this mask is applied to the latent representation before producing each feature. In practice, we learn the per-feature mask with the help of the Spike-and-Slab Lasso prior (Ročková and George, 2018), Published in Transactions on Machine Learning Research (09/2022) xi,1 xi,2 xi,3 xi,4 xi,5 (a) Sparse DGM xi,1 xi,2 xi,3 xi,4 xi,5 Figure 1: In a DGM, a feature xij depends on all factors, zik. A sparse DGM is displayed where features xi1, xi2 depend only on zi1; xi3 depends on zi1 and zi2; and xi4, xi5 depend only on zi2. The features are passed through the same neural network fθ. a sparsity-inducing prior which enjoys good properties in more classical settings. We fit this sparse DGM and SSL prior with amortized variational inference (Gershman and Goodman, 2014; Kingma and Welling, 2014). We call this procedure the sparse VAE. Second, we study identifiability in sparse DGMs. A model is identifiable if each setting of its parameters produces a unique distribution of the observed data. With an unidentifiable DGM, even with infinite data from the model, we cannot distinguish the true factors. Identifiable factor models are important to tasks such as domain adaptation (Locatello et al., 2020), transfer learning (Dittadi et al., 2021), and fair classification (Locatello et al., 2019a). Most DGMs are not identifiable (Locatello et al., 2019b). Specifically, we prove that a sparse DGM is identifiable if each latent factor has at least two anchor features. An anchor feature is a dimension of the data which depends on only one latent factor. For example, in text data, an anchor feature is a word that is applicable to only one underlying topic (Arora et al., 2013). Note that the anchor features do not need to be known in advance for our results to hold; we only need to assume they exist. Further, we do not need to assume that the true factors are drawn independently; in many applications, such an independence assumption is unrealistic (Träuble et al., 2021). The sparse VAE and the anchor assumption are related in that the SSL prior will likely yield the kind of sparse mapping that satisfies the anchor assumption. What this means is that if the data comes from an identifiable DGM, i.e., one that satisfies the anchor assumption, then the SSL prior will encourage the sparse VAE to more quickly find a good solution. In this sense, this paper is in the spirit of research in Bayesian statistics, which establishes theoretical conditions on identifiability and then produces priors that favor identifiable distributions. Empirically, we compare the sparse VAE to existing algorithms for fitting DGMs: the VAE (Kingma and Welling, 2014), β-VAE (Higgins et al., 2017), Variational Sparse Coding (VSC, Tonolini et al., 2020), and OI-VAE (Ainsworth et al., 2018). We find that (i) on synthetic data, the sparse VAE recovers ground truth factors, including when the true factors are correlated; (ii) on text and movie-ratings data, the sparse VAE achieves better heldout predictive performance; (iii) on semi-synthetic data, the sparse VAE has better heldout predictive performance on test data that is distributed differently to training data; (iv) the sparse VAE finds interpretable structure in text, movie-ratings and genomics data. In summary, the contributions of our paper are as follows: We develop a sparse DGM which has Spike-and-Slab Lasso priors (Ročková and George, 2018) and develop an algorithm for fitting this model, the sparse VAE. We prove that sparse DGMs are identified under an anchor feature assumption. We study the sparse VAE empirically. It outperforms existing methods on synthetic, semi-synthetic, and real datasets of text, ratings, and genomics. Related Work. The sparse DGM provides a contribution to sparse methods in linear and nonlinear representation learning. The sparse DGM has a sparsity-inducing prior on the factor-to-feature mapping Published in Transactions on Machine Learning Research (09/2022) (i.e., the decoder). Similar priors have also been applied in linear factor analysis to induce sparsity in the factor-to-feature mapping (i.e., the loadings matrix); see, for example Bernardo et al. (2003); Bhattacharya and Dunson (2011); Carvalho et al. (2008); Knowles and Ghahramani (2011); Ročková and George (2016). Beyond linearity, Barello et al. (2018) considers a hybrid linear/nonlinear VAE which uses a sparse linear decoder and a neural network encoder. In nonlinear representation learning, Tonolini et al. (2020) imposes sparsity-inducing priors directly on the latent factors. Instead, the sparse DGM of this paper involves a sparsity-inducing prior on the factor-to-feature mapping. Rhodes and Lee (2021) indirectly induces sparsity in the factor-to-feature mapping of their DGM via L1 regularization of the mapping s Jacobian. The sparse VAE closely relates to the OI-VAE (Ainsworth et al., 2018), a method for modeling grouped features where the factor-to-group mapping is sparse. However, there are several differences. First, the OI-VAE uses a Group Lasso prior while the sparse VAE has a Spike-and-Slab Lasso prior (SSL, Ročková and George, 2018). The SSL prior mitigates common issues with the Lasso prior, including under-regularization of small coefficients and over-regularization of large coefficients (Ghosh et al., 2016), and a sub-optimal posterior contraction rate (Castillo et al., 2015). Second, we study the identifiability of sparse DGMs, a property that Ainsworth et al. (2018) does not investigate. This paper also contributes to the literature that uses the anchor feature assumption for identifiability. In linear models, the anchor feature assumption has led to identifiability results for topic models (Arora et al., 2013), non-negative matrix factorization (Donoho and Stodden, 2003), and linear factor analysis (Bing et al., 2020). The key idea is that the anchor assumption removes the rotational invariance of the latent factors. This idea also arises in identifiable factor analysis (Rohe and Zeng, 2020) and independent component analysis, where rotational invariance is removed by assuming that the components are non-Gaussian (Eriksson and Koivunen, 2004). Finally, this paper is related to methods in identifiable nonlinear representation learning. To achieve identifiability, most methods require weak supervision: Khemakhem et al. (2020); Mita et al. (2021); Zhou and Wei (2020) require auxiliary data; Locatello et al. (2020) requires paired samples; Kügelgen et al. (2021) relies on data augmentation; and Hälvä et al. (2021); Ahuja et al. (2022); Lachapelle et al. (2022) leverage known temporal or spatial dependencies among the samples. In contrast, the sparse DGM does not need auxiliary information or weak supervision for identifiability; instead, the anchor feature assumption is sufficient. Relatedly, Horan et al. (2021) achieves identifiable representations under a local isometry assumption, meaning a small change in a factor corresponds to a small change in the features. In contrast to this paper, however, Horan et al. (2021) requires the factors to be independent; we do not need such an independence assumption. 2 The Sparse Variational Autoencoder The observed data is a vector of G features xi RG for i {1, . . . , N} data points. The goal is to estimate a low-dimensional set of factors zi RK where K G. In a standard DGM, the vector zi is a latent variable that is fed through a neural network to reconstruct the distribution of xi. The sparse DGM introduces an additional parameter wj RK, j {1, . . . , G}, a sparse per-feature vector that selects which of the K latent factors are used to produce the jth feature of xi. Further, the sparse DGM models the prior covariance between factors with the positive definite matrix Σz RK K. The sparse DGM is: wjk Spike-and-Slab Lasso(λ0, λ1, a, b), k = 1, . . . , K zi NK(0, Σz), i = 1, . . . , N xij N((fθ(wj zi))j, σ2 j ), j = 1, . . . , G (1) where fθ : RK RG is a neural network with weights θ, σ2 j is the per-feature noise variance and denotes element-wise multiplication. The Spike-and-Slab Lasso (Ročková and George, 2018) is a sparsityinducing prior which we will describe below. (Any sparsity-inducing prior on wjk would be appropriate for a sparse DGM; we choose SSL because it enjoys theoretical guarantees and fast computation in more classical settings.) Published in Transactions on Machine Learning Research (09/2022) Algorithm 1: The sparse VAE input: data X, hyperparameters λ0, λ1, a, b, Σz output: factor distributions qϕ(z|x), selector matrix W , parameters θ while not converged do for j = 1, . . . , G; k = 1, . . . , K do E [γjk|wjk, ηk] = 1 + 1 ηk ψ0(wjk) ψ1(wjk) for k = 1, . . . , K do PG j=1 E [γjk|wjk, ηk] + a 1 a + b + G 2 ; Update θ, ϕ, W with SGD according to Eq. 8; In the data generating distribution in Eq. 1, the parameter wjk controls whether the distribution of xij depends on factor k. If wjk = 0, then zik contributes to xij, while if wjk = 0, zik cannot contribute to xij. If wj is sparse, then xij depends on only a small set of factors. Such a sparse factor-to-feature relationship is shown in Figure 1. Note that xij depends on the same set of factors for every sample i. This dependency only makes sense when each xij has a consistent meaning across samples. For example, in genomics data xij corresponds to the same gene for all i samples. (This dependency is not reasonable for image data, where pixel j has no consistent meaning across samples.) Spike-and-Slab Lasso. The parameter wjk has a SSL prior (Ročková and George, 2018), which is defined as a hierarchical model, ηk Beta(a, b), (2) γjk Bernoulli(ηk), (3) wjk γjkψ1(wjk) + (1 γjk)ψ0(wjk). (4) In this prior, ψs(w) = λs 2 exp( λs|w|) is the Laplace density and λ0 λ1. The variable wjk is drawn a priori from either a Laplacian spike parameterized by λ0, and is consequentially negligible, or a Laplacian slab parameterized by λ1, and can be large. Further, the variable γjk is a binary indicator variable that determines whether wjk is negligible. The Beta-Bernoulli prior on γjk allows for uncertainty in determining which factors contribute to each feature. Finally, the parameter ηk [0, 1] controls the proportion of features that depend on factor k. By allowing ηk to vary, the sparse DGM allows each factor to contribute to different numbers of features. In movie-ratings data, for example, a factor corresponding to the action/adventure genre may be associated with more movies than a more esoteric genre. Notice the prior on ηk helps to zero out extraneous factor dimensions and consequently estimate the number of factors K. If the hyperparameters are set to a 1/K and b = 1, the Beta-Bernoulli prior corresponds to the finite Indian Buffet Process prior (Griffiths and Ghahramani, 2005). 2.1 Inference In inference, we are given a dataset x1:N and want to calculate the posterior p(θ, W , Γ, η, z1:N|x1:N). We will use approximate maximum a posteriori (MAP) estimation for W , θ and η, and amortized variational inference for z1:N. The full procedure for approximating this posterior is called a sparse VAE. Published in Transactions on Machine Learning Research (09/2022) x.1 = 1 z1 x.2 = 2.1 z1 x.3 = 4.5 z1 2 x.4 = 3.8 z2 x.5 = 4.7 z2 x.6 = 6.2 sin(z2) x.7 = 5.2 z1 z2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 2 0 2 4 2 0 2 4 2 0 2 4 5 0 5 Z (estimated) (d) Estimated W Figure 2: (a-b) The sparse VAE estimates factors which recover the true generative process; the VAE does not. The observed data is plotted against the estimated factors. The true factor-feature relationship is the red line; the best fit coefficients for the estimated factors are in the grey boxes. (c) The true W matrix. (d) The sparse VAE estimate of W . (VAE has no W matrix). The exact MAP objective is i=1 log pθ(xi, W , η) = i=1 log Z pθ(xi|zi, W )p(zi)dzi + log Z p(W |Γ)p(Γ|η)p(η)dΓ, (5) where Γ = {γjk}G,K j,k=1 denotes the binary indicators. We lower bound Eq. 5 with variational approximations of the posterior of z1:N and the exact posterior of the latent Γ, which comes from the SSL prior. We approximate the posterior of the factors pθ(zi|xi) by the variational family: qϕ(zi|xi) = NK(µϕ(xi), σ2 ϕ(xi)), (6) where µϕ( ) and σ2 ϕ( ) are neural networks with weights ϕ, as in a standard VAE (Kingma and Welling, 2014; Rezende et al., 2014). These approximations yield our final objective function: L(θ, ϕ, W , η) = Eqϕ(zi|xi)[log pθ(xi|zi, W )] DKL(qϕ(zi|xi)||p(zi))} (7) + EΓ|W ,η [log[p(W |Γ)p(Γ|η)p(η)]] . (8) To optimize L(θ, ϕ, W , η), we alternate between an expectation step and a maximization step. In the E-step, we calculate the complete conditional posterior of Γ. In the M-step, we take gradient steps in model parameters and variational parameters with the help of reparameterization gradients (Kingma and Welling, 2014; Rezende et al., 2014). The sparse VAE procedure is in Algorithm 1. The sparse VAE has a higher computational burden than a VAE. A VAE typically requires one forward pass through the neural network at each gradient step. In contrast, the sparse VAE requires G forward passes. This additional computation is because for each of the G features, the sparse VAE uses a different decoder input wj zi, for j = 1, . . . , G. (The VAE and sparse VAE optimize the same number of neural network parameters, however). 2.2 Synthetic example For intuition, consider a synthetic dataset. We generate N = 1000 samples from a factor model, each with G = 7 features and K = 2 latent factors. We take the true factors to be correlated and generate them as zi N(0, C) where the true factor covariance C has diagonal entries Ckk = 1 and off-diagonal entries Ckk = 0.6 for k = k . Published in Transactions on Machine Learning Research (09/2022) Given the factors, the data is generated as: xi NG(f(zi), σ2IG) with σ2 = 0.5. (9) The factor-to-feature mapping f : RK RG is sparse: f(zi) = (zi1, 2zi1, 3z2 i1, 4zi2, 5zi2, 6 sin(zi2), 7zi1 zi2). This mapping corresponds to the W matrix in Figure 2c. We compare the sparse VAE with a VAE. For both, we initialize with an overestimate of the true number of factors (K0 = 5). The sparse VAE finds factors that reflect the true generative model (Figure 2a). Moreover, the sparse VAE correctly finds the true number of factors by setting three columns of W to zero (Figure 2d). In contrast, the VAE recovers the second factor fairly well, but doesn t recover the first factor (Figure 2b). We analyze this synthetic data further in 4, as well as text, movie ratings and genomics data. 3 Identifiability of Sparse DGMs For an identifiable model, given infinite data, it is possible to learn the true parameters. In both linear and nonlinear factor models, the factors are usually unidentifiable without additional constraints (Yalcin and Amemiya, 2001; Locatello et al., 2019b). This lack of identifiability means that the true factors zi can have the same likelihood as another solution ezi. Even with infinite data, we cannot distinguish between these representations based on the likelihood alone. Further inductive biases or model constraints are necessary to narrow down the set of solutions. In this section, we demonstrate that sparsity is a useful inductive bias for model identifiability. Consider the sparse deep generative model: zi NK(0, C), i = 1, . . . , N xij N((fθ(wj zi))j, σ2 j ), j = 1, . . . , G, (10) where C is the true covariance matrix of the factors. Informally, if the masking variable W is sparse in a particular way, then we can identify the latent factors, up to coordinate-wise transformation. (Note: we only make an assumption about the sparse support of W , not its distribution; the SSL prior in 2 is a tool to expore the space of sparse W in the process of fitting the model.) Specifically, we prove that for any solutions z and ez with equal likelihood, we have: zik = gk(ezik), i = 1, . . . , N, (11) for all factors k = 1, . . . , K (up to permutation of k), where gk : R R are invertible and differentiable functions. This definition of identifiability is weaker than the canonical definition, for which we would need the two solutions zi, ezi to be exactly equal. We use this weaker notion of identifiability because our goal is to isolate the dimensions of zi which drive the observed response, and not necessarily find their exact value i.e. we want to avoid mixing the dimensions of zi. For example, we want to be able to distinguish between solutions z and ez = P z, for arbitrary rotation matrices P . Starting point. This paper builds on a body of work in identifiable linear unsupervised models (Arora et al., 2013; Bing et al., 2020; Donoho and Stodden, 2003). These works assume the existence of anchor features, features which depend on only one factor. The existence of anchor features helps with identifiability because they leave a detectable imprint in the covariance of the observed data. We extend these results to the nonlinear setting. The key insight is that if two anchor features have the same nonlinear dependence on a factor, then we can apply results from the linear setting. We first formally state the anchor feature assumption. Published in Transactions on Machine Learning Research (09/2022) A1. (Anchor feature assumption) For every factor, z k, there are at least two features x j, x j which depend only on that factor. Moreover, the two features have the same mapping fj from the factors z k; that is, for all i = 1, . . . , N: E[xij|zi] = fj(zik), E[xij |zi] = fj(zik). (12) We refer to such features as anchor features. The connection between the anchor feature assumption and the masking variable W is the following: If feature j is an anchor for factor k, then wjk = 0 and wjk = 0 for all k = k . Roadmap. We prove the identifiability of the sparse DGM in two parts. First, we prove that the latent factors are identifiable up to coordinate-wise transformation when the anchor features are known (Theorem 1). Second, we prove that the anchor features can be detected from the observed covariance matrix (Theorem 2). Together, Theorem 1 and Theorem 2 give the identifiability of the sparse DGM. Known anchor features. The first result, Theorem 1, proves that the sparse DGM factors are identifiable if we are given the anchor features. If feature j is an anchor feature for factor k, we set wjk = 1 and wjk = 0 for all k = k. Theorem 1. Suppose we have infinite data drawn from the model in Eq. 10 and A1 holds. Assume we are given the rows of W corresponding to the anchor features. Suppose we have two solutions with equal likelihood: {eθ, ez} and {bθ, bz}, with peθ(x|ez, W ) = pbθ(x|bz, W ). (13) Then, the factors ez and bz are equal up to coordinate wise transformations. For gk : R R and i = 1, . . . , N, (ezi1, . . . , ezi K) = (g1(bzi1), . . . , g K(bzi K)). (14) The proof of Theorem 1 is in Appendix A.1. The proof idea is that if the anchor features are known, then we are given noisy transformations of the coordinates of z. Knowing these noisy transforms allows us to identify z up to coordinate-wise transform. Unknown anchor features. The second result proves that the anchor features can be determined from an approximation of the covariance matrix of the data X. This result requires additional assumptions. The next assumption concerns the weights of the neural network fθ. This assumption needs some additional notation. After applying the chain rule to the first layer, we rewrite the derivative of mapping fθ as follows: (fθ(wj zi))j uijd H(1) dk wjk, (15) where {H(1) dk }D1,K d,k=1 are the weights of the first neural network layer and D1 is the dimension of the first layer, uijd = PK k=1 H(1) dk wjkzik is the first layer before the activation function, and mθ : RD1 RG is the rest of the neural network which takes as input the first layer uij = (uijd)D1 d=1. Let Bijk = PD1 d=1 mθ(uij )j uijd H(1) dk . A2. Suppose j is an anchor feature for factor k. For another feature l, we assume that i=1 |Bijk|2 k =1 |wlk ||Bijk||Bilk |, (16) with equality when l is also an anchor feature for k and inequality otherwise. This assumption ensures that the covariance between two anchor features is larger than the covariance between an anchor feature and a non-anchor feature. This consequence then allows us to pinpoint the Published in Transactions on Machine Learning Research (09/2022) anchor features in the covariance matrix of the data. In Appendix A.3, we show this assumption holds for a neural network with Re LU activations and independent weights. Finally, we also require an assumption regarding the covariance of the factors: this is the same as assumption (iii) of Bing et al. (2020), which studies linear models. A3. Denote the true covariance matrix of the factors as Cov(zi) = C. We assume min{Ckk, Ck k } > |Ckk |, for all k, k = 1, . . . , K. Assumption A3 requires that the variance of each factor be greater than its covariance with any other factor. To gain some intuition, consider a dataset of news articles where the latent factors are topics, sports and politics. Assumption A3 would be violated if every article about sports also discussed politics (and vice versa). In this case, the anchor word for sports will be equally as correlated with an anchor word for politics as it is with another anchor word for sports. That is, there is no discernible difference in the data that helps us distinguish the sports and politics factors. Theorem 2. Assume the model Eq. 10 with A1 3 holds. Then, the set of anchor features can be determined uniquely from 1 N PN i=1 Cov(xi) as N (given additional regularity conditions detailed in Appendix A.2). The proof of Theorem 2 is in Appendix A.2. Proof idea: We adapt the proof technique of Bing et al. (2020). The proof idea of that paper is that for any two features x j and x j that are anchors for the same factor, their covariance Cov(x j, x j ) will be greater than their covariance with any other non-anchor feature (under some conditions on the loadings matrix, analogous to A2). Then, the anchor features can be pinpointed from the observed covariance matrix. In the nonlinear case, we can apply a similar strategy: even if the mapping is nonlinear, if the mapping is the same for the anchor features, we can pinpoint the two anchors in the covariance matrix. Connection to the sparse VAE algorithm. Theorem 2 proves that the existence of anchor features implies a particular structure in the covariance of the data. Consequently, an appropriate estimation procedure can then pinpoint the anchor features. In the sparse VAE, we do not directly consider the covariance matrix as it would involve laborious hyperparameter tuning to determine which covariance entries are close enough" to be anchor features. Instead, we estimate W with an SSL prior, motivated by the sparsity in the decoder exhibited by anchor features. That is, it is the sparse decoding for the anchor features that is needed for Theorem 2, not the SSL prior; the SSL prior is a modeling choice that will likely yield the kind of sparse mapping that satisfies the anchor assumption. The SSL prior is not the only sparsity-inducing prior that we could have used; however, we found it to work well empirically. We compare the SSL prior to the horseshoe prior (Carvalho et al., 2009) in Appendix C.4.1. Consistency. An important implication of identifiability is consistency: if we learn the optimal parameters of the sparse VAE, then we will recover the true factors in the limit of infinite data. Specifically, if (i) the variational family qϕ(z|x) contains the true posterior pθ(z|x) and (ii) the ELBO is maximized with respect to the parameters θ, ϕ and W , then in the limit of infinite data, the sparse VAE will learn the true factors up to coordinate-wise transform and permutation. 4 Experiments We empirically study the sparse VAE using a mix of synthetic, semi-synthetic and real data1. We consider the following questions: 1) How well does the sparse VAE recover ground truth factors when (i) the factors are uncorrelated and (ii) the factors are correlated? 2) How does the heldout predictive performance of the sparse VAE compare to related methods? 3) How does the sparse VAE perform compared to related methods when the correlation between factors is different in the training and test data? 4) Does the sparse VAE find meaningful structure in data? We compare the sparse VAE to non-negative matrix factorization (NMF) and algorithms for DGMs: the VAE (Kingma and Welling, 2014); β-VAE (Higgins et al., 2017); VSC (Tonolini et al., 2020); and OI-VAE 1The sparse VAE implementation may be found at https://github.com/gemoran/sparse-vae-code. Published in Transactions on Machine Learning Research (09/2022) ρ= 0 ρ= 0.2 ρ= 0.4 ρ= 0.6 ρ= 0.8 (a) Heldout mean squared error (lower is better) ρ= 0 ρ= 0.2 ρ= 0.4 ρ= 0.6 ρ= 0.8 (b) Ground truth factor recovery (higher is better) Figure 3: Synthetic data. (a) Sparse VAE has better heldout predictive performance than the VAE over a range of factor correlation levels. (b) Sparse VAE recovers the true factors better than the VAE. (β-VAE performed similarly to VAE). Scores are shown for 25 datasets per correlation setting. (Ainsworth et al., 2018). None of the underlying DGMs have identifiability guarantees. We find that: 1) on synthetic data, the sparse VAE achieves the best recovery of ground truth factors both when the factors are correlated and uncorrelated; 2) the sparse VAE achieves the best predictive performance on both synthetic and real datasets; 3) the sparse VAE achieves the best predictive performance on heldout data that is generated from factors that different correlation to those in the training data; and (4) the sparse VAE finds interpretable factors in both text and movie-ratings data. On a single-cell genomics dataset, the sparse VAE learns factors which predict cell type with high accuracy. Datasets. We consider the following datasets; further details about the data are given in Appendix C.1. Synthetic data. We consider again the synthetic dataset generated as Eq. 9. We vary correlation between the true factors from ρ = 0, 0.2, 0.4, 0.6, 0.8. Peer Read (Kang et al., 2018). Dataset of word counts for paper abstracts (N 10, 000, G = 500). Movie Lens (Harper and Konstan, 2015). Dataset of binary user-movie ratings (N = 100, 000, G = 300). Semi-synthetic Peer Read. A semi-synthetic version of the Peer Read dataset in which the correlations between data-generating factors differ across the test and training data, inducing different training and test distributions. Zeisel (Zeisel et al., 2015). Dataset of RNA molecule counts in mouse cortex cells (N = 3005, G = 558). Implementation details. Empirically, we found that setting the prior covariance of the factors to Σz = IK worked well, even when the true generative factors were correlated (Figure 3a). Further implementation details for all methods are in Appendix C. Recovering the ground truth factors. How well does the sparse VAE recover the factors when the ground truth factors are known? We assess this question with simulated data. We use the DCI disentanglement score to measure the fidelity between estimated factors and true factors (Eastwood and Williams, 2018). The DCI score is an average measure of how relevant each estimated factor is for the true factors. This score also penalizes estimated factors that are equally informative for multiple true factors. We create synthetic datasets with the model given in Eq. 9 and evaluate the DCI metric between estimated and true factors as the true factors are increasingly correlated. As the correlation increases, we expect a standard VAE to conflate the factors while the sparse VAE recovers the two true underlying factors. We see this phenomenon in Figure 3b; the sparse VAE has higher DCI scores than the VAE in all settings. The sparse VAE is robust to factor correlations up to ρ = 0.4, with decreasing performance as ρ is further increased. Here, VSC performs worse than both the sparse VAE and VAE (the MSE scores for VSC in were too large for visualization in Figure 3a). The poor performance of VSC is likely due to the true generative factors in Eq. 9 not being sparse; only the mapping between the factors and features is sparse. Note that we do not compare with NMF in this example; as NMF is a linear method, it cannot reconstruct the nonlinear data generation process using only two latent dimensions. Heldout reconstruction. We compare the negative log-likelihood on heldout data achieved by the sparse VAE and related methods. All results are averaged over five splits of the data, with standard deviation in Published in Transactions on Machine Learning Research (09/2022) Table 1: On movie-ratings and text data, the sparse VAE achieves the lowest heldout negative log-likelihood (NLL). For the β-VAE, we show the result with best performing β. (a) Movie Lens Method NLL Recall@5 NDCG@10 Sparse VAE 170.9 (2.1) 0.98 (0.002) 0.98 (0.003) VAE 175.9 (2.4) 0.97 (0.001) 0.96 (0.001) β-VAE (β = 2) 178.2 (2.4) 0.95 (0.002) 0.93 (0.002) OI-VAE 212.1 (2.6) 0.51 (0.004) 0.50 (0.004) VSC 192.2 (2.3) 0.79 (0.008) 0.77 (0.009) NMF 198.6 (2.5) 0.90 (0.004) 0.85 (0.004) (b) Peer Read Sparse VAE 245.0 (2.0) VAE 252.6 (1.4) β-VAE (β = 2) 254.5 (3.0) OI-VAE 260.6 (1.2) VSC 252.9 (2.0) NMF 267.2 (1.0) Table 2: On the semi-synthetic Peer Read data, the sparse VAE achieves the lowest heldout negative loglikelihood. We create three settings where the difference between training and test data distributions range from high (hardest) to low (easiest). Difference between train and test High Medium Low Method Negative log-likelihood Sparse VAE 52.4 (0.4) 49.2 (0.3) 48.6 (0.1) VAE 54.6 (0.5) 52.3 (0.2) 50.8 (0.2) β-VAE (β = 2) 54.7 (0.3) 52.1 (0.2) 51 (0.4) OI-VAE 65.7 (0.3) 64.4 (0.3) 64.6 (0.2) VSC 58.7 (0.6) 56.1 (0.3) 55.4 (0.2) NMF 60.1 (0.02) 58.0 (0.08) 57.3 (0.6) parentheses. Figure 3a shows that the sparse VAE performs better than the compared methods on synthetic data. On Movie Lens and Peer Read datasets, Tables 1a and 1b show that the Sparse VAE achieves the lowest heldout negative log-likelihood among the compared methods. For the Movie Lens dataset, Table 1a additionally shows that the sparse VAE has the highest heldout Recall@5 and normalized discounted cumulative gain (NDCG@10), which compare the predicted rank of heldout items to their true rank (Liang et al., 2018). Different training and test distributions. How does the sparse VAE perform when the factors that generate data are correlated differently across training and test distributions? This particular type of shift in distribution affects many real world settings. For example, we may estimate document representations from scientific papers where articles about machine learning also often discuss genomics, and want to use the representations to analyze new collections of papers where articles about machine learning rarely involve genomics. We hypothesize that because the sparse VAE associates each latent factor with only a few features (e.g., words) even when the factors are correlated, it will reconstruct differently distributed test data better than the related methods. We assess this question using the semi-synthetic Peer Read dataset, where the train and test data were generated by factors with different correlations. Table 2 summarizes the results from three settings where the difference between training and test data distributions range from high (hardest) to low (easiest). We report the average negative log-likelihood across five semi-simulated datasets. The sparse VAE performs the best, highlighting its ability to estimate models which generalize better to test data where the factors have a different distribution. Interpretability. We now examine whether the sparse VAE finds interpretable structure in the data. For each factor in the Movie Lens dataset, we consider the four movies with the largest selector variable bwjk values (Table 3a). The sparse VAE finds clear patterns: for example, the top movies in first factor are all science fiction movies; the second factor contains animated children s movies; the third factor contains three Star Wars movies and an Indiana Jones movie, all blockbuster science fiction movies from the 1980s. Published in Transactions on Machine Learning Research (09/2022) Table 3: On movie-ratings and text data, the sparse VAE finds meaningful topics via the matrix W . (a) Movie Lens Topic Movies A The Fifth Element; Alien; Gattaca; Aliens B A Bug s Life; Monsters, Inc.; Toy Story 3 & 2 C Star Wars V, IV & IV; Indiana Jones 1 (b) Peer Read Topic Words A task; policy; planning; heuristic; decision B information; feature; complex; sparse; probability C given; network; method; bayesian; neural Table 4: On the Zeisel data, the sparse VAE has the best heldout cell type prediction accuracy, followed closely by a VAE. Method Sparse VAE VAE (β = 1) NMF VSC OIVAE Accuracy 0.95 (0.003) 0.94 (0.016) 0.91 (0.007) 0.89 (0.062) 0.80 (0.019) We perform the same analysis for the Peer Read dataset (Table 3b). The sparse VAE finds some themes of computer science papers: the first factor contains words related to reinforcement learning; the second factor contains words about information theory; the third factor involves words about neural networks. We note that NMF also finds interpretable topics in both Movie Lens and Peer Read (Appendix C.5.1); however, NMF has worse heldout predictive performance than the sparse VAE (Tables 1a and 1b). The sparse VAE retains the interpretability of NMF while incorporating flexible function estimation. Finally, we consider the single-cell genomics dataset of Zeisel et al. (2015). We examine how well the factors learned by each method predict the cell type in a multinomial regression (note the cell type is not used when learning the factors). The sparse VAE is the best performing method (Table 4). Although the VAE is competitive in this setting, we see that the sparse VAE does better than methods such as OI-VAE and VSC that were designed to produce interpretable results. 5 Discussion We developed the sparse DGM, a model with SSL priors that induce sparsity in the mapping between factors and features. Under an anchor feature assumption, we proved that the sparse DGM model has identifiable factors. To fit the sparse DGM, we develop the sparse VAE algorithm. On real and synthetic data, we show the sparse VAE performs well. (i) It has good heldout predictive performance. (ii) It generalizes to out of distribution data. (iii) It is interpretable. There are a few limitations of this work. First, the sparse DGM is designed for tabular data, where each feature j has a consistent meaning across samples. Image data does not have this property as a specific pixel has no consistent meaning across samples. Second, the sparse VAE algorithm is more computationally intensive than a standard VAE. This expense is because at each gradient step, it requires a forward pass through the network for each of the G features. Finally, we did not provide theoretical estimation guarantees for the sparse VAE algorithm. Similarly to other algorithms for fitting DGMs, estimation guarantees for the sparse VAE are difficult to obtain due to the nonconvexity of the objective function. The majority of works which consider identifiability in DGMs also do not provide estimation guarantees for their algorithms, including Locatello et al. (2020); Khemakhem et al. (2020). We leave such theoretical study of the sparse VAE algorithm to future work. Acknowledgments Funding to support this research was provided for by the Eric and Wendy Schmidt Center, the Canada-CIFAR AI Chairs program, National Science Foundation Grant NSF-CHE-2231174, NSF IIS 2127869, ONR N0001417-1-2131, ONR N00014-15-1-2209, the Simons Foundation, the Sloan Foundation, and Open Philanthropy. Published in Transactions on Machine Learning Research (09/2022) Ahuja, K., Hartford, J., and Bengio, Y. (2022). Properties from mechanisms: an equivariance perspective on identifiable representation learning. In International Conference on Learning Representations. Ainsworth, S. K., Foti, N. J., Lee, A. K., and Fox, E. B. (2018). oi-VAE: Output interpretable VAEs for nonlinear group factor analysis. In International Conference on Machine Learning. Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu, M. (2013). A practical algorithm for topic modeling with provable guarantees. In International Conference on Machine Learning. Barello, G., Charles, A. S., and Pillow, J. W. (2018). Sparse-coding variational auto-encoders. bio Rxiv. Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798 1828. Bernardo, J. M., Bayarri, M. J., Berger, J. O., Dawid, A. P., Heckerman, D., Smith, A. F. M., and West, M. (2003). Bayesian factor regression models in the large p, small n paradigm. Bayesian Statistics, 7:733 742. Bhattacharya, A. and Dunson, D. B. (2011). Sparse Bayesian infinite factor models. Biometrika, pages 291 306. Bing, X., Bunea, F., Ning, Y., and Wegkamp, M. (2020). Adaptive estimation in structured factor models with applications to overlapping clustering. Annals of Statistics, 48(4):2055 2081. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993 1022. Bolstad, B. (2018). preprocess Core: A collection of pre-processing functions. R package version 1.44.0. Bolstad, B. M., Irizarry, R. A., Åstrand, M., and Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185 193. Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q., and West, M. (2008). High-dimensional sparse factor modeling: applications in gene expression genomics. Journal of the American Statistical Association, 103(484):1438 1456. Carvalho, C. M., Polson, N. G., and Scott, J. G. (2009). Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics. Castillo, I., Schmidt-Hieber, J., and Van der Vaart, A. (2015). Bayesian linear regression with sparse priors. The Annals of Statistics, 43(5):1986 2018. Dittadi, A., Träuble, F., Locatello, F., Wuthrich, M., Agrawal, V., Winther, O., Bauer, S., and Schölkopf, B. (2021). On the transfer of disentangled representations in realistic settings. In International Conference on Learning Representations. Donoho, D. L. and Stodden, V. (2003). When does non-negative matrix factorization give a correct decomposition into parts? In Neural Information Processing Systems. Eastwood, C. and Williams, C. K. (2018). A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations. Eriksson, J. and Koivunen, V. (2004). Identifiability, separability, and uniqueness of linear ICA models. IEEE Signal Processing Letters, 11(7):601 604. Gershman, S. and Goodman, N. (2014). Amortized inference in probabilistic reasoning. In Proceedings of the Annual Meeting of the Cognitive Science Society. Ghosh, P., Tang, X., Ghosh, M., and Chakrabarti, A. (2016). Asymptotic properties of Bayes risk of a general class of shrinkage priors in multiple hypothesis testing under sparsity. Bayesian Analysis, 11(3):753 796. Published in Transactions on Machine Learning Research (09/2022) Ghosh, S., Yao, J., and Doshi-Velez, F. (2019). Model selection in Bayesian neural networks via horseshoe priors. Journal of Machine Learning Research, 20(182):1 46. Griffiths, T. L. and Ghahramani, Z. (2005). Infinite latent feature models and the Indian buffet process. In Neural Information Processing Systems. Hälvä, H., Corff, S. L., Lehéricy, L., So, J., Zhu, Y., Gassiat, E., and Hyvarinen, A. (2021). Disentangling identifiable features from noisy data with structured nonlinear ICA. In Neural Information Processing Systems. Harper, F. M. and Konstan, J. A. (2015). The Movie Lens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (Tii S), 5(4):1 19. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. (2017). beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations. Horan, D., Richardson, E., and Weiss, Y. (2021). When is unsupervised disentanglement possible? In Neural Information Processing Systems. Kang, D., Ammar, W., Dalvi, B., van Zuylen, M., Kohlmeier, S., Hovy, E., and Schwartz, R. (2018). A dataset of peer reviews (Peer Read): Collection, insights and NLP applications. ar Xiv preprint ar Xiv:1804.09635. Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. (2020). Variational autoencoders and nonlinear ICA: A unifying framework. In Artificial Intelligence and Statistics. Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations. Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In International Conference on Learning Representations. Knowles, D. and Ghahramani, Z. (2011). Nonparametric Bayesian sparse factor models with application to gene expression modeling. Annals of Applied Statistics, 5(2B):1534 1552. Kügelgen, J. V., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., and Locatello, F. (2021). Self-supervised learning with data augmentations provably isolates content from style. In Neural Information Processing Systems. Lachapelle, S., Rodriguez, P., Sharma, Y., Everett, K. E., Priol, R. L., Lacoste, A., and Lacoste-Julien, S. (2022). Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. In Conference on Causal Learning and Reasoning. Liang, D., Krishnan, R. G., Hoffman, M. D., and Jebara, T. (2018). Variational autoencoders for collaborative filtering. In World Wide Web Conference. Locatello, F., Abbati, G., Rainforth, T., Bauer, S., Schölkopf, B., and Bachem, O. (2019a). On the fairness of disentangled representations. Neural Information Processing Systems. Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., and Bachem, O. (2019b). Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning. Locatello, F., Poole, B., Rätsch, G., Schölkopf, B., Bachem, O., and Tschannen, M. (2020). Weaklysupervised disentanglement without compromises. In International Conference on Machine Learning. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., and Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12):1053 1058. Mita, G., Filippone, M., and Michiardi, P. (2021). An identifiable double VAE for disentangled representations. In International Conference on Machine Learning. Published in Transactions on Machine Learning Research (09/2022) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning. Rhodes, T. and Lee, D. (2021). Local disentanglement in variational auto-encoders using Jacobian l1 regularization. In Neural Information Processing Systems. Ročková, V. and George, E. I. (2016). Fast Bayesian factor analysis via automatic rotations to sparsity. Journal of the American Statistical Association, 111(516):1608 1622. Ročková, V. and George, E. I. (2018). The spike-and-slab lasso. Journal of the American Statistical Association, 113(521):431 444. Rohe, K. and Zeng, M. (2020). Vintage factor analysis with varimax performs statistical inference. ar Xiv preprint ar Xiv:2004.05387. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288. Tonolini, F., Jensen, B. S., and Murray-Smith, R. (2020). Variational sparse coding. In Uncertainty in Artificial Intelligence. Träuble, F., Creager, E., Kilbertus, N., Locatello, F., Dittadi, A., Goyal, A., Schölkopf, B., and Bauer, S. (2021). On disentangled representations learned from correlated data. In International Conference on Machine Learning. Yalcin, I. and Amemiya, Y. (2001). Nonlinear factor analysis as a statistical method. Statistical Science, pages 275 294. Zeisel, A., Muñoz-Manchado, A. B., Codeluppi, S., Lönnerberg, P., Manno, G. L., Juréus, A., Marques, S., Munguba, H., He, L., Betsholtz, C., Rolny, C., Castelo-Branco, G., Hjerling-Leffler, J., and Linnarsson, S. (2015). Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science, 347(6226):1138 1142. Zhou, D. and Wei, X.-X. (2020). Learning identifiable and interpretable latent models of high-dimensional neural activity using pi-vae. Neural Information Processing Systems. Published in Transactions on Machine Learning Research (09/2022) A.1 Proof of Theorem 1 We consider the case of known anchor features, where we have at least two anchor features per factor. We consider the case where the likelihood is Gaussian. We have two solutions, (eθ, ez, eσ2) and (bθ, bz, bσ2), with equal likelihood. We show that ez must be a coordinate-wise transform of bz. 1 eσ2 j [xij feθj(f wj ezi)]2 = 1 bσ2 j [xij fbθj(c wj bzi)]2, (17) where fθj( ) denotes fθ( )j, the jth output of fθ. We prove that, given the anchor feature for factor k, zik is identifiable. As all factors are assumed to have anchor features, this holds for all k {1, . . . , K}. Suppose j is an anchor feature for k. That is, wjl = 0 for all l = k. Then, f wj ezi = (0, . . . , 0, ewjkezik, 0, . . . , 0). With slight abuse of notation, we then write feθj(f wj ezi) = feθj( ewjkezik). Now, for Eq. 17 to hold for any value of {xij}N,G i,j=1 we must have, for all j = 1, . . . , G: 1 eσ2 j [xij feθj( ewjkezik)]2 = 1 bσ2 j [xij fbθj( bwjkbzik)]2 (18) i=1 x2 ij 2 feθj( ewjkezik) eσ2 j fbθj( bwjkbzik) f 2 eθj( ewjkezik) eσ2 j f 2 bθj( bwjkbzik) For Eq. 19 to hold for all xij, we must have the coefficients equal to zero: bσ2 j = 0 = eσ2 j = bσ2 j (20) feθj( ewjkezik) eσ2 j fbθj bwjk(bzik) bσ2 j = 0 = feθj( ewjkezik) = fbθj( bwjkbzik). (21) If feθj : R R is invertible, then we have ezik = ew 1 jk f 1 eθj (fbθj( bwjkbzik)). (22) That is, zik is identifiable up to coordinate-wise transformation. If fθj is not invertible, then there exists bijective functions g : R R, h : R R such that bwjkbzik = g( ewjkezik) and the equality is satisfied feθj( ewjkezik) = feθj(h(g( ewjkezik))) (as fθj is not invertible, ewjkezik = h(g( ewjkezik))) (23) = fbθj( bwjkbzik). (24) As we have bzik = bw 1 jk g( ewjkezik), zik is identifiable up to coordinate-wise transformation. Published in Transactions on Machine Learning Research (09/2022) A.2 Proof of Theorem 2 To prove the result, we: 1. Approximate the covariance 1 N PN i=1 Cov(xi) using a first order Taylor approximation. 2. Show that an anchor feature for factor k always has larger covariance with another anchor feature for k than other non-anchor features. Specifically: for any k {1, . . . , K} and j = anchor for k, we have PN i=1 Cov(xij, xil) = Ckkw2 jk PN i=1 B2 ijk for all l = anchor for k, PN i=1 Cov(xij, xil) < Ckkw2 jk PN i=1 B2 ijk for all l = anchor for k. 3. Apply Theorem 1 of Bing et al. (2020) to prove that the anchor features can be identified from the covariance of the data. Regularity condition. Let µ(zi) = E[xi|zi]. We assume: for all zi in a neighborhood of bzi = E[zi], i=1 µ(zi) = 1 i=1 bzi + µ(zi) zi=b zi(zi bzi) + o P (1) (25) where o P (1) denotes a random variable converging to 0 in probability. Step 1. Here is the marginal covariance of the Sparse VAE: Var(xi) = E [Var(xi|zi)] + Var(E [xi|zi]) (26) = Σ + Var(µ(zi)) (27) Take the first order Taylor expansion µ(zi) bzi + µ(zi) zi zi=b zi(zi bzi). Var(µ(zi)) µ(zi) zi=b zi Var(zi) Now, µ(zi) = {fθj(wj zi)}G j=1. Then, for any two features, j and l, we have Cov(xij, xil) zi=b zi Ckk (29) where Var(zi) = C. Step 2. We re-write fθj( ) to expose the first layer of the neural network (before the activation function is applied): fθj(wj zi) = mθj(uij ) (30) where uij = {uijd}D1 d=1 with k=1 H(1) dk wjkzik, (31) where {H(1) dk }D1,K d,k=1 are the weights of the first neural network layer with dimension D1. Published in Transactions on Machine Learning Research (09/2022) Cov(xij, xil) = k =1 wjkwlk d=1 H(1) dk mθj(uij ) d=1 H(1) dk mθl(uil ) k =1 wjkwlk Bijk Bilk Ckk , (33) where Bijk = PD1 d=1 H(1) dk mθj (uij ) Suppose j is an anchor feature for k. Then the absolute covariance of feature j and any other feature l is: |Cov(xij, xil)| = |wjk Bijk k =1 wlk Bilk Ckk | (34) |Ckkwjk Bijk k =1 wlk Bilk | by A3. (35) Ckkw2 jk B2 ijk by A2. (36) Step 3. In Step 2, we proved the equivalent of Lemma 2 of Bing et al. (2020), adapted for the Sparse VAE. This allows us to apply Theorem 1 of Bing et al. (2020), which proves that the anchor features can be determined uniquely from the approximate covariance matrix, 1 N PN i=1 Cov(xi), as N . A.3 Discussion of identifiability assumptions In this section, we examine the suitability of assumption A2. We do so by showing A2 holds for a three-layer neural network with Re LU activation functions, independently distributed Gaussian weights and no bias terms. Specifically: E [xij|wj, zi] = H(3) j Re LU(H(2) Re LU(H(1)(wj zi))) (37) where H(1) RD K are the weights for layer 1 with D hidden units, H(2) RD D are the weights for layer 2, and H(3) j denotes the jth row of the weights for layer 3, H(3) RG D. d2=1 H(3) j,d2H(2) d2,d1H(1) d1,kwjk I h H(2) d2 Re LU(H(1)(wj zi)) > 0 i I k=1 H(1) d1,kwjkzik > 0 where I( ) is the indicator function. Assume {wjk}G,K j,k=1 are independent and symmetric around zero. Assume all weights are independent and distributed as: H(m) d1,d2 i.i.d. N(0, τ) for all layers m = 1, 2, 3. Taking the first order Taylor approximation, for any two features, j and l, we have Cov(xij, xil) d 2=1 (H(3) j,d2H(2) d2,d1H(1) d1,kwjk)(H(3) l,d 2H(2) d 2,d 1H(1) d 1,k wlk ) (39) I h H(2) d2 Re LU(H(1)(wj bzi)) > 0 i I k=1 H(1) d1,kwjkbzik > 0 I h H(2) d 2 Re LU(H(1)wl bzi) > 0 i I k=1 H(1) d 1,kwlkbzik > 0 Published in Transactions on Machine Learning Research (09/2022) As the neural network weights are independent Gaussians, we keep only the terms where d1 = d 1 and d2 = d 2: Cov(xij, xil) d2=1 H(3) j,d2H(3) l,d2 d1=1 (H(2) d2,d1)2 K X k =1 Ckk H(1) d1,k H(1) d1,k wjkwlk (42) I h H(2) d2 Re LU(H(1)(wj bzi)) > 0 i I k=1 H(1) d1,kwjkbzik > 0 I h H(2) d2 Re LU(H(1)wl bzi) > 0 i I k=1 H(1) d1,kwlkbzik > 0 If j and l are both anchor features for factor k, we have: Cov(xij, xil) d2=1 (H(3) j,d2)2 D X d1=1 (H(2) d2,d1)2Ckk(H(1) d1,k)2w2 jk (45) I h H(2) d2 Re LU(H(1)(wj bzi)) > 0 i I k=1 H(1) d1,kwjkbzik > 0 as H(3) j,d2 = H(3) l,d2 and wjk = wlk by definition of an anchor feature. Meanwhile, if j and l are not anchor features for the same factor, we have Cov(xij, xil) d2=1 H(3) j,d2H(3) l,d2Ad2 (47) d1=1 (H(2) d2,d1)2 K X k =1 Ckk H(1) d1,k H(1) d1,k wjkwlk I h H(2) d2 Re LU(H(1)(wj bzi)) > 0 i (48) k=1 H(1) d1,kwjkbzik > 0 I h H(2) d2 Re LU(H(1)wl bzi) > 0 i I k=1 H(1) d1,kwlkbzik > 0 As the weights are independent, we have that H(3) j,d2, H(3) l,d2, Ad2 are independent. For large D, we then have d2=1 H(3) j,d2H(3) l,d2Ad2 0. (50) Hence, for two anchor features j and l and non-anchor feature m, we have Cov(xij, xil) > Cov(xij, xim). Published in Transactions on Machine Learning Research (09/2022) B Inference B.1 Derivation of ELBO for identity prior covariance In the main body of the paper, we set the prior covariance to Σz = I as it performed well empirically and was computationally efficient. For this setting, the objective is derived as i=1 log pθ(xi, W , η) = i=1 log Z pθ(xi|zi, W )p(zi)dzi + log Z p(W |Γ)p(Γ|η)p(η)dΓ (51) i=1 log Eqϕ(zi|xi) pθ(xi|zi, W ) p(zi) qϕ(zi|xi) + log EΓ|W ,η p(W |Γ)p(Γ|η)p(η) Eqϕ(zi|xi)[log pθ(xi|zi, W )] DKL(qϕ(zi|xi)||p(zi)) + EΓ|W ,η [log[p(W |Γ)p(Γ|η)p(η)]] L(θ, ϕ, W , η). (53) The KL divergence between the variational posterior qϕ(zi|xi) and the prior zi NK(0, I) is: DKL(qϕ(zi|xi)||p(zi)) = 1 1 + log(σ2 ϕ(xi)) (µϕ(xi))2 σ2 ϕ(xi) (54) The final term of the ELBO in Eq. 53 is: EΓ|W (t),η(t) [log[p(W |Γ)p(Γ|η)p(η)]] = j=1 λ (w(t) jk , η(t) k )|wjk| + j=1 E[γjk|w(t) jk , η(t) k ] + a 1 j=1 E[γjk|w(t) jk , η(t) k ] + b 1 log(1 ηk), (56) E[γjk|wjk, ηk] = ηkψ1(wjk) ηkψ1(wjk) + (1 ηk)ψ0(wjk) (57) λ (wjk, ηk) = λ1E[γjk|wjk, ηk] + λ0(1 E[γjk|wjk, ηk]). (58) As described in Algorithm 1, we alternate between updating E[γjk|w(t) jk , η(t) k ] and ηk, and taking gradient steps for θ, ϕ and W . B.2 Derivation of ELBO for general prior covariance In this section, we detail the sparse VAE ELBO for general Σz. We do not study this algorithm in this paper but we include the derivation here for completeness. The change in the sparse VAE ELBO for general Σz is the KL divergence term: DKL(qϕ(zi|xi)||p(zi)) = 1 k=1 [1 + log(σ2 ϕ(xi))] tr(Σ 1 z diag(σ2 ϕ(zi))) µϕ(zi)T Σ 1 z µϕ(zi) log |Σz| Published in Transactions on Machine Learning Research (09/2022) C Details of empirical studies C.1 Dataset details and preprocessing Peer Read (Kang et al., 2018). We discard any word tokens that appear in fewer than about 0.1% of the abstracts and in more than 90% of the abstracts. From the remaining word tokens, we consider the G = 500 most used tokens as features. The observations are counts of each feature across N 11, 000 abstracts. Semi-synthetic Peer Read The training and testing data are distributed differently because we vary the correlations among the latent factors that generate the features across the two datasets. This data was generated as follows: (i) we applied Latent Dirichlet Allocation (LDA, Blei et al., 2003) to the Peer Read dataset using K = 20 components to obtain factors θ RN K and topics (loadings) β RK G. (ii) We created train set factors, θtr, by dropping the last K 2 columns of θ and replacing them with columns calculated as: logit f θ.k = logit θ.k + N(0, σ2) for each of the first K 2 latent dimension k. We fix the test set factors as θte = θ. Movie Lens (Harper and Konstan, 2015). We consider the Movie Lens 25M dataset. Following (Liang et al., 2018), we code ratings four or higher as one and the remaining ratings as zero. We retain users who have rated more than 20 movies. We keep the top G = 300 most rated movies. Finally, we randomly subsample N = 100, 000 users. Zeisel (Zeisel et al., 2015). We first processed the data following (Zeisel et al., 2015). Next, we normalized the gene counts using quantile normalization (Bolstad et al., 2003; Bolstad, 2018). Finally, following (Lopez et al., 2018), we retained the top G = 558 genes with the greatest variability over cells. C.2 Sparse VAE settings For all experiments, the Sparse VAE takes the ηk prior hyperparameters to be a = 1, b = G, where G is the number of observed features. For experiments with Gaussian loss, the prior on the error variance is: σ2 j Inverse-Gamma(ν/2, νξ/2). (60) We set ν = 3. The hyperparameter ξ is set to a data-dependent value. Specifically, we first calculate the sample variance of each feature, x j. Then, we set ξ such that the 5% quantile of the sample variances is the 90% quantile of the Inverse-Gamma prior. The idea is: the sample variance is an overestimate of the true noise variance. The smaller sample variances are assumed to correspond to mostly noise and not signal these variances are used to calibrate the prior on the noise. C.3 Experimental settings For the neural networks, we use multilayer perceptrons with the same number of nodes in each layer. These settings are detailed in Table 6. For stochastic optimization, we use automatic differentiation in Py Torch, with optimization using Adam (Kingma and Ba, 2015) with default settings (beta1=0.9, beta2=0.999) For LDA, we used Python s sklearn package with default settings. For NMF, we used sklearn with alpha=1.0. The dataset-specific experimental settings are in Table 6. C.4 Additional experiments C.4.1 Synthetic data We consider an additional experiments to answer the questions: Published in Transactions on Machine Learning Research (09/2022) ρ= 0 ρ= 0.2 ρ= 0.4 ρ= 0.6 ρ= 0.8 (a) Heldout mean squared error (lower is better) ρ= 0 ρ= 0.2 ρ= 0.4 ρ= 0.6 ρ= 0.8 Sparse VAE Slab (b) Ground truth factor recovery (higher is better) 0 0.2 0.4 0.6 0.8 Sparse VAE Slab (c) F-score of W support recovery (higher is better) Figure 4: Synthetic data. The Sparse VAE with less regularization on W (Sparse VAE-Slab) performs slightly worse than Sparse VAE with more regularization on W (Sparse VAE) in terms of (a) MSE; (b) DCI Disentanglement Score; and (c) F-score of the estimated support of W . 1. How does the sparse VAE perform when the regularization parameter in the sparsity-inducing prior is decreased? 2. Does the Spike-and-Slab Lasso prior perform similarly to an alternate sparsity-inducing prior, the Horseshoe prior (Carvalho et al., 2009)? Experiment 1. We find that the sparse VAE with less regularization (λ0 = λ1 = 0.001) performs slightly worse than the sparse VAE with λ0 = 10, λ1 = 1 in terms of MSE (Figure 4a) and DCI (Figure 4b). The sparse VAE with less regularization still performs well, however this is because the W it learns also tends to be somewhat sparse (Figure 4c). Note that when λ0 = λ1, the Spike-and-Slab Lasso prior is equivalent to a Lasso prior (Tibshirani, 1996). Experiment 2. We find that the Spike-and-Slab Lasso and the horseshoe perform similarly for lower values of the factor correlation ρ, but that the Spike-and-Slab has lower MSE and higher DCI than the horseshoe for higher values of ρ. To implement the horseshoe prior, we follow Ghosh et al. (2019) and parameterize the half-Cauchy distribution C+(0, b) as: a C+(0, b) a2|λ Inv-Gamma(1/2, 1/λ), λ Inv-Gamma(1/2, 1/b2). (61) Then, we assign W a horseshoe prior: λglobal Inv-Gamma(1/2, 1/b2 global), (62) λlocal Inv-Gamma(1/2, 1/b2 local), (63) τjk|λlocal Inv-Gamma(1/2, 1/λlocal), (64) vk|λglobal Inv-Gamma(1/2, 1/λglobal), (65) wjk|τjk, vk N(0, (t2 jkv2 k)I), (66) where s Inv-Gamma(a, b) is the inverse-Gamma distribution with density p(s) s a 1 exp{ b/s} for s > 0. We set the hyperparameters to bglobal = blocal = 1.0. Published in Transactions on Machine Learning Research (09/2022) ρ= 0 ρ= 0.2 ρ= 0.4 ρ= 0.6 ρ= 0.8 (a) Heldout mean squared error (lower is better) ρ= 0 ρ= 0.2 ρ= 0.4 ρ= 0.6 ρ= 0.8 (b) Ground truth factor recovery (higher is better) Figure 5: Synthetic data. The Sparse VAE with the Spike-and-Slab-Lasso prior performs slightly better than Sparse VAE with a horseshoe prior W in terms of (a) MSE; (b) DCI Disentanglement Score. With this horseshoe prior on W , we again optimize W using coordinate ascent and then use closed-form updates for the inverse-Gamma parameters: τ (t) jk = 0.5 w(t) jk 2/(2v(t 1) k + 1/λ(t 1) local ), (67) v(t) k = 2/(G + 3) j=1 w(t)2 jk /2τ 2 jk + 1/λ(t 1) global λ(t) local = 2/(GK + 3) k,j=1 1/τ (t)2 jk + 1/blocal λ(t) global = 2/(K + 3) k=1 1/v(t)2 k + 1/bglobal Published in Transactions on Machine Learning Research (09/2022) G=300 G=1000 5000 20000 100000 5000 20000 100000 (a) Negative heldout log likelihood (lower is better) G=300 G=1000 5000 20000 100000 5000 20000 100000 (b) Heldout recall (higher is better) Figure 6: Movie Lens. The Sparse VAE generally performs as well as or better than a VAE on (a) negative heldout log-likelihood and (b) heldout recall when varying the sample size and number of features (movies). Results are averaged over 5 splits of the data. C.5 Movie Lens: varying sample size and number of features In this section, we consider the performance of the Sparse VAE when we vary the sample size and number of features (movies) in the Movie Lens dataset. The settings are the same as noted in Table 6, except for the latent space dimension which is set to K = 50. Generally, the Sparse VAE performs as well as or better than a VAE on negative heldout log-likelihood and heldout recall (Figure 6). For the largest setting (N = 100, 000, G = 1, 000), the Sparse VAE is about 100 times slower than a VAE. C.5.1 Movie Lens and Peer Read: comparisons with NMF We include additional comparisons of the Sparse VAE with non-negative matrix factorization (NMF) on the Movie Lens and Peer Read datasets. On the Movie Lens data, examples of topics found by NMF include those related to science fiction; animated childrens movies; and Star Wars (although Toy Story is included in the Star Wars topic). This performance is similar to the Sparse VAE. On the Peer Read data, examples of topics found by NMF include those related to reinforcement learning; language modeling; and neural network models. In summary, NMF finds interpretable topics, similarly to the Sparse VAE. However, the Sparse VAE has better heldout predictive performance than NMF. This provides evidence for the Sparse VAE retaining the interpretability of linear methods while improving model flexibility. C.6 Compute GPU: NVIDIA TITAN Xp graphics card (24GB). CPU: Intel E4-2620 v4 processor (64GB). Published in Transactions on Machine Learning Research (09/2022) Table 5: On Movie Lens (left) and Peer Reads (right), NMF finds meaningful topics. Topic Movies A Alien; Aliens; Terminator 2; Terminator; Blade Runner B Shrek 1 & 2; Monsters, Inc.; Finding Nemo; The Incredibles C Star Wars IV, V & VI; Toy Story Topic Words A agent; action; policy; state; game B word; representation; vector; embeddings; semantic C model; inference; neural; latent; structure Table 6: Settings for each experiment. Synthetic data Settings Value # hidden layers 3 # layer dimension 50 Latent space dimension 5 Learning rate 0.01 Epochs 200 Batch size 100 Loss function Gaussian Sparse VAE λ1 = 1, λ0 = 10 β-VAE [2, 4, 6, 8, 16] VSC α = 0.01 OI-VAE λ = 1, p = 5 Runtime per split CPU, 2 mins Settings Value # hidden layers 3 # layer dimension 300 Latent space dimension 30 Learning rate 0.0001 Epochs 100 Batch size 100 Loss function Softmax Sparse VAE λ1 = 1, λ0 = 10 β-VAE [2, 4, 6, 8, 16] VSC α = 0.01 OI-VAE λ = 1, p = 5 Runtime per split GPU, 1 hour Settings Value # hidden layers 3 # layer dimension 100 Latent space dimension 20 Learning rate 0.01 Epochs 40 Batch size 128 Loss function Softmax Sparse VAE λ1 = 0.001, λ0 = 5 β-VAE [2, 4, 6, 8, 16] VSC α = 0.01 OI-VAE λ = 1, p = 5 Runtime per split GPU, 20 mins Semi-synthetic Peer Read Settings Value # hidden layers 3 # layer dimension 50 Latent space dimension 20 Learning rate 0.01 Epochs 50 Batch size 128 Loss function Softmax Sparse VAE λ1 = 0.01, λ0 = 5 β-VAE [2, 4, 6, 8, 16] VSC α = 0.01 OI-VAE λ = 1, p = 5 Runtime per split GPU, 30 mins Settings Value # hidden layers 3 # layer dimension 100 Latent space dimension 15 Learning rate 0.01 Epochs 100 Batch size 512 Loss function Gaussian Sparse VAE λ1 = 1, λ0 = 10 VSC α = 0.01 OI-VAE λ = 1, p = 5 Runtime per split CPU, 15 mins