# sparse_autoencoders_again__6dab4791.pdf

Sparse Autoencoders, Again?

Yin Lu 1 Xuening Zhu 1 Tong He 2 David Wipf 2

Is there really much more to say about sparse autoencoders (SAEs)? Autoencoders in general, and SAEs in particular, represent deep architectures that are capable of modeling low-dimensional latent structure in data. Such structure could reflect, among other things, correlation patterns in large language model activations, or complex natural image manifolds. And yet despite the wide-ranging applicability, there have been relatively few changes to SAEs beyond the original recipe from decades ago, namely, standard deep encoder/decoder layers trained with a classical/deterministic sparse regularizer applied within the latent space. One possible exception is the variational autoencoder (VAE), which adopts a stochastic encoder module capable of producing sparse representations when applied to manifold data. In this work we formalize underappreciated weaknesses with both canonical SAEs, as well as analogous VAEs applied to similar tasks, and propose a hybrid alternative model that circumvents these prior limitations. In terms of theoretical support, we prove that global minima of our proposed model recover certain forms of structured data spread across a union of manifolds. Meanwhile, empirical evaluations on synthetic and real-world datasets substantiate the efficacy of our approach in accurately estimating underlying manifold dimensions and producing sparser latent representations without compromising reconstruction error. In general, we are able to exceed the performance of equivalent-capacity SAEs and VAEs, as well as recent diffusion models where applicable, within domains such as images and language model activation patterns. A link to the code is here.

1School of Data Science, Fudan University 2Amazon Web Services. Correspondence to: Yin Lu <yinlu23@m.fudan.edu.cn>, Xuening Zhu <xueningzhu@fudan.edu.cn>, Tong He <htong@amazon.com>, David Wipf <davidwipf@gmail.com>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1. Introduction

Autoencoders represent a widely-applicable neural network model for obtaining useful low-dimensional representations of data, particularly when training labels are not available (Berahmand et al., 2024; Goodfellow et al., 2016). The basic architecture is simple: An encoder network first projects each observable data sample to a latent representation (usually low-dimensional), after which a decoder network attempts sample reconstruction based on this latent representation. Sparse autoencoders (SAE) (Ng, 2011; Ranzato et al., 2006; 2007) adopt an additional degree of design flexibility by accommodating a higher-dimensional latent space, but with the inclusion of an attendant regularization factor applied to the encoder output that encourages sparse representations (meaning most dimensions are pushed to zero). This capability facilitates adaptive latent representations reflecting complex underlying data structure, whereby the pattern of nonzeros can vary on a sample-by-sample basis unlike with a regular autoencoder.

Informally speaking, autoencoders (including SAEs) can be viewed as something like the horseshoe crabs of the deep learning kingdom, having existed for decades (or eons ) with more or less the same hardy conceptual design (Bourlard & Kamp, 1988; Hinton & Zemel, 1993; Le Cun, 1987; Rumelhart et al., 1986). For example, the SAE models used for learning interpretable representations of present-day large language model (LLM) activation patterns (Bricken et al., 2023; Cunningham et al., 2024; Chaudhary & Geiger, 2024; Gao et al., 2024) are nearly identical in form to original architectures from long ago; likewise for a variety of other modern use cases (Lan et al., 2024; Movva et al., 2025; Pach et al., 2025; Stevens et al., 2025). One notable exception is the variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014), which substitutes the deterministic latent space of traditional autoencoders, with an alternative stochastic representation. Although seemingly distinct from the SAE, it has recently been shown that VAEs can also address sparse autoencoding tasks in certain settings (Dai et al., 2021). Still, VAE models notwithstanding, given the long-lasting and durable usage of SAEs with minimal changes, it is natural to question the extent to which further innovation is possible on top of the original SAE design space, or if any such innovation is even needed at all.

Sparse Autoencoders, Again?

We answer such questions in the affirmative based on the observation that both SAE and VAE designs have notable unresolved, yet distinct, shortcomings. While details will be deferred to Section 2, we summarize here that SAEs generally rely on multiple hyperparameters and potentially-nonconvex regularization factors that complicate both training and the interpretation of the latent representations produced by the encoder. In contrast, VAEs benefit from a hyperparameterfree and smoother loss surface, but are presently ill-equipped to produce adaptive sparse representations like the SAE. With this backdrop in mind, our contributions herein are encapsulated as follows:

Algorithmic: Although both SAE and VAE models can reconstruct data using sparse encoder representations, we carefully detail their disjoint strengths and weaknesses in Section 2; subsequently we propose a hybrid VAE-like approach in Section 3 that delivers the best of both worlds. We refer to our model as VAEase, for an easy-to-apply VAE with adaptive sparse estimation.

Theoretical: In Section 4 we rigorously prove that for data residing on a union of low-dimensional manifolds, globally-optimal VAEase solutions can provably recover the underlying manifold structure, adapting to multiple distinct manifolds, unlike existing VAEs. Likewise, we demonstrate simple illustrative settings whereby VAEase maintains fewer local minima relative to an analogous deterministic SAE model.

Empirical: Finally, we present complementary experiments in Section 5 that show VAEase outperforms existing SAE and VAE architectures, as well as related diffusion models, across tasks such as estimating manifold dimensions or maximally sparse representations of large language model activations.

We conclude this section by emphasizing that, although our proposed approach explicitly relies on a variational chassis, the work itself as a whole is not directly related to generative modeling per se, the original purpose for which VAEs were designed. Instead, we are repurposing specific VAE attributes for new sparse autoencoding tasks.

2. Basics of Sparse Autoencoding

In this section we will first present a broad SAE formulation covering typical use cases. Later, we will demonstrate how VAEs can be leveraged for similar purposes, before concluding with a brief contextualization w.r.t. related diffusion models, as have recently been applied to learning manifold dimensions.

2.1. Canonical SAE Formulation

Suppose we collect d-dimensional data points x X Rd drawn from some probability measure ω such that R

X ω(dx) = 1. Like traditional autoencoders (Bourlard & Kamp, 1988; Le Cun, 1987; Rumelhart et al., 1986), an SAE model is composed of an encoder that maps samples x to a latent representation z Rκ, and a decoder that subsequently attempts to reconstruct x from z. These modules are instantiated as functions given by µz( ; ϕ) : X Rκ

and µx( ; θ) : Rκ Rd respectively, with the goal of learning parameters {θ, ϕ} such that x µx µz(x; ϕ); θ

for all x X.1

What differentiates SAEs from regular auto-encoders (AEs) is that, in addition to seeking an encoder-decoder pairing capable of accurately reconstructing data samples, we also favor latent representations z = µz(x; ϕ) that are sparse (Goodfellow et al., 2016; Ng, 2011; Ranzato et al., 2006; 2007). More formally, we ideally seek z 0 κ (or at least approximately so), where 0 denotes the ℓ0-norm that counts the number of nonzero elements in a vector. To operationalize this preference, SAE models are generally trained with a loss in the form LSAE(ϕ, θ) := Z

n x µx z; θ 2

2 + λ1h(z) o ω(dx) + λ2 θ 2 2 (1)

or a variant thereof, subject to z = µz(x; ϕ). In this expression, h : Rκ R serves as a penalty designed to approximate the ℓ0 norm while maintaining differentiability almost everywhere for training purposes. Candidates for h include the ℓ1 norm h(z) = P

j |zj| (the tightest convex relaxation to ℓ0 norm) or h(z) = P

j log(|zj| + ϵ) where ϵ > 0 is a small constant (Candes et al., 2008; Fazel et al., 2003); there exist many other approximation possibilities as well (Chen et al., 2017; Fan & Li, 2001; Palmer et al., 2006). However, a consequence of all these approximation candidates is the inevitable introduction of a scaling ambiguity, whereby elements of z can be pushed arbitrarily close to zero, while a targeted subset of θ can be proportionately increased towards infinity to compensate, such that the overall decoder reconstruction is unchanged. The penalty term θ 2 2 in (1) is included to prevent this sort of degeneracy and ensure that h induces non-trivial sparsity; see Appendix A.1 for further details. A similar effect can be accomplished by adding explicit constraints on the norm of θ during training (Cunningham et al., 2024) or allowing only the top k values of z to be nonzero (Gao et al., 2024; Makhzani & Frey, 2013), although these approaches are effectively substituting one form of hyperparameter for another. And finally, we note that λ1 and λ2 in (1) are non-negative trade-off parameters balancing regularization effects and reconstruction error.

1We remark that the range of µx need not be X exactly, and perfect reconstructions will not necessarily be feasible at all times.

Sparse Autoencoders, Again?

Dark colors = Adaptive Sparse Representation

All colors = Fixed Sparse Representation

Figure 1: Fixed vs. adaptive sparsity. Depicted are two manifolds {M1, M2} with dim[M1] = dim[M2] = 2, and data points x1 M1 and x2 M 2. Adaptive sparse representations require only two informative/active latent dimensions for either x1 or x2 (the dark colored elements of {z1, z2} which vary across samples). Meanwhile, fixed sparse representations involve shared active dimensions across all samples (the union of dark and light elements).

SAE Advantages: After suitable training, the SAE formulation from (1) is quite powerful, capable of producing sparse representations of input samples drawn from ω. In particular, a trained SAE can exhibit what is known as adaptive or dynamic sparsity (Ponti & Martins, 2024), whereby the support pattern (meaning location of nonzero elements) of a given latent code z can vary for different input samples x. See Figure 1 for an illustration. This flexibility is critical because, as we will further detail in Section 4, it allows the SAE encoder to differentiate samples that lie on distinct manifold structures (as frequently encountered in practice (Brown et al., 2022)), or expand and contract the number of nonzero latent dimensions in accordance with sample-specific manifold complexity. As a representative example, the images of MNIST hand-written digits 1 may generally require fewer degrees-of-freedom to reconstruct than an observably more complex digits 8 .

SAE Disadvantages: On the downside, minimizers of (1) will typically be sensitive to the selection of λ1 and λ2, as well as the specific curvature of h. And for the latter, tighter approximations to the ℓ0 norm are relatively more difficult to optimize, with potentially numerous suboptimal local minimizers introduced in tandem with highly nonconvex penalties. Meanwhile, looser approximations such as the ℓ1 norm, although easier to minimize, may not generate optimally sparse latent representations. Hence for a given reconstruction error, z 0 may be significantly larger compromising interpretability. Appendix A.3 provides an illustrative example of this contrast.

2.2. Alternative Variational Routes to SAE

Although originally proposed as a form of deep generative model for drawing new synthetic samples that approximately follow the ground-truth measure ω, it has recently been shown that VAEs (Kingma & Welling, 2014; Rezende et al., 2014) can actually be repurposed to handle sparse autoencoding tasks (Dai et al., 2021). Broadly speaking, VAE models can be viewed as stochastic autoencoders, whereby the corresponding encoder and decoder modules now define distributions over z and sample reconstructions respectively. For models of continuous data (our focus), these distributions are commonly expressed as

qϕ(z|x) = N z µz[x; ϕ], diag σz(x; ϕ) 2

pθ(x|z) = N x µx[z; θ], γI , (2)

where the functions µz and µx now serve as encode-decoder mean networks, while σz (upon squaring) defines a corresponding variance network. In accordance with typical use cases, the decoder variance is defined by a single parameter γ > 0, which may be trained along with other parameters. The corresponding VAE objective is given by LVAE(θ, ϕ) := Z

n Eqϕ(z|x) log pθ(x|z) +KL qϕ(z|x)||p(z) o ω(dx)

(3) with p(z) = N(z|0, I); the first term above represents a reconstruction factor analogous to the first term in (1), while the latter KL term enacts VAE-specific regularization. Notably, for any encoder distribution qϕ(z|x), it follows that LVAE(θ, ϕ) R

X log pθ(x)ω(dx), with equality iff qϕ(z|x) = pθ(z|x) almost surely. Using a simple reparameterization trick, it is possible to directly minimize (3) using SGD (Kingma & Welling, 2014; Rezende et al., 2014).

The stochastic VAE design obscures its ability to behave as a sparse autoencoder, and indeed, the mechanism for pruning unneeded dimensions of the latent space is distinct from canonical SAEs. While we defer details to prior work, the crux of the idea is that superfluous dimensions of z are flooded with uninformative white noise that is subsequently filtered by the decoder to avoid compromising accurate reconstructions. Mathematically, we have that qθ(zj|x) N(zj|0, 1) for unneeded dimension j, which in the sequel we will refer to as an inactive dimension. This is in contrast with a so-called active dimension j

used to support reconstruction, which is characterized by qθ(zj |x) N zj |µz(x; ϕ)j , O(γ) . Because under relevant circumstances the optimal decoder variance will satisfy γ 0 during training, there exists a nearly deterministic pathway for producing arbitrarily accurate VAE reconstructions using these active dimensions (Dai et al., 2021; Zheng et al., 2022).

Sparse Autoencoders, Again?

red & green cells are easy to

distinguish by decoder

0 1.3 0 -0.7 0 0 0.8 0.5 -0.3 0.4 1.3 0.5 -0.7 -1 0.3 0.8 0.5 -0.3

red & green cells are hard to

distinguish by decoder

0 1.3 0 -0.7 0 0 0.8 0.5 -0.3

red & green cells are easy to

distinguish by decoder

Figure 2: Comparison among methods. On the left SAE active (green) and inactive (red) dimensions are easily identifiable by the decoder, so accurate reconstructions are not disrupted when the locations of the nonzero elements adaptively shift from input sample-to-sample. Meanwhile in the middle figure, VAE active and inactive dimensions look similar for any given sample, which favors the VAE simply learning a fixed sparsity pattern such that the decoder knows which dimensions are active. And finally, on the right the VAEase-specific reparameterized representation ez mimics the original SAE behavior (as γ 0 near global optima), restoring the ability to adapt the sparsity pattern for each input sample. In this way, VAEase is more flexible in steering away from the fixed-sparsity representations favored by the vanilla VAE.

VAE Advantages: Since γ can be learned along with other parameters as part of the overall probabilistic model, in principle sensitive hyperparameter tuning is not necessary. Additionally, if fixed sparsity is sufficient, then VAE models are potentially superior to SAEs because they provably have fewer suboptimal local minima in certain settings (Wipf, 2023). This is a desirable artifact of the stochastic VAE encoder, which is uniquely capable of selectively smoothing away bad minima while simultaneously preserving sharp global optimal with maximal sparsity.

VAE Disadvantages: The very stochastic smoothing that contributes to VAE advantages is also its Achilles heel. Specifically, whenever a given latent dimension j is inactive for one or more data samples, this same dimension is more likely to be inactive for all data samples. This is because, unlike with SAE models, the first VAE decoder layer is generally forced to assign a zero-valued weight to incoming inactive dimensions, which are characterized by white noise that would otherwise contribute to high reconstruction errors at the decoder output. Please see Figure 2 for an illustrative example. And once a zero-valued weight is assigned during training, this dimension will remain permanently blocked across all input samples. Hence a de facto fixed sparsity paradigm is implicitly enforced (see Figure 1), in contradistinction to the more flexible adaptive/dynamic sparsity enjoyed by SAE models as discussed previously. This limitation is especially problematic in real-world data comprised of complex manifold structure.

2.3. Diffusion Models

We conclude Section 2 by briefly mentioning less direct connections between SAEs and diffusion models (DMs) (Ho et al., 2020; Sohl-Dickstein et al., 2015). As powerful generative architectures, DMs can be viewed as a hierarchical VAE, but with a parameter-free encoder that assumes κ = d

(Kingma et al., 2021; Luo). Because of this, DMs are not capable of directly producing sparse latent representations as SAEs and VAEs intrinsically do; they are therefore not a direct surrogate for sparse autoencoding tasks in general. However, it has nonetheless been recently demonstrated that using clever post-processing strategies, DMs are in fact capable of quantifying the dimensionality of data lying on low-dimensional manifolds (Horvat & Pfister, 2024; Kamkari et al., 2024; Stanczuk et al., 2024; Tempczyk et al., 2022). So in this sense their ancillary capabilities overlap with SAEs/VAEs as we will briefly explore in Section 5.3.

3. Sparse Autoencoding with (VA)Ease

In the previous section we highlighted contrasting strengths of SAE and VAE models: the former enjoys dynamic, inputadaptive sparsity but at the expense of additional hyperparameter and penalty tuning; meanwhile the latter requires no hyperparameters but can only reliably produce fixed sparsity patterns in practice. We now take steps towards achieving the best of both worlds via a novel, yet deceptively simple SAE/VAE hybrid loss formulation.

Our motivation comes from the fact that certain forms of conditional VAE models have in fact been shown to produce adaptive sparsity patterns (Zheng et al., 2022). CVAE models augment the decoder (and possibly the encoder and prior as well) to condition on some additional observable y leading to the revised distribution pθ(x|z, y). From here, specially-designed decoder self-attention mechanisms can exploit y to selectively turn on and off latent dimensions following an input-specific manner, at least to the extent that y varies in accordance with x.2

But how to proceed more generally in normal circumstances

2While it may be possible to achieve something similar without conditioning variables, the learning process is considerably more challenging. Please see further discussion in Appendix B.1.

Sparse Autoencoders, Again?

without access to helpful conditioning variables? Our insight here is two-fold:

1. In VAE models, the σz network contains information about which dimensions are active, but it does not directly modulate reconstructions themselves as µz does.

2. But we cannot just arbitrarily introduce σz to the decoder to achieve the adaptive sparsity we are after; this would directly lead to overfitting (see Appendix B.3). Instead, we would like to specifically leverage σz only as a gating mechanism to clamp the noise introduced by inactive dimensions. In this way the decoder weights themselves need not permanently close any particular inactive dimension across all inputs.

With these considerations in mind, we propose the following easily-applicable modification to a traditional VAE: We replace pθ(x|z) as defined in (2) with

pθ x|z, σz[x; ϕ] = N x µx[ez; θ], γI ,

where ez := 1 σz[x; ϕ] z (4)

and denotes the Hadamard product. While seemingly a minor change, the consequences are profound in terms of how inactive dimensions are now handled. In particular, if qθ(zj|x) N(zj|0, 1) for inactive dimension j, then ezj 0, and so the decoder now receives an uninformative but still deterministic input that need not disrupt accurate reconstructions even for a decoder input layer with nonzero weights (analogous to the canonical SAE model). Meanwhile, for an active dimension j with qθ(zj |x) N zj |µz(x; ϕ)j , O(γ) , we have ezj µz(x; ϕ)j such that transmissible information for reconstructing x is not compromised. Figure 2 and Appendix B.3, convey the highlevel picture of this phenomena.

We refer to the proposed VAE modification from above as VAEase, and summarize the newly-enabled capabilities in Table 1. Critically, VAEase preserves the original properties that make VAE models attractive for sparse autoencoding (i.e., no sensitive hyperparameters required, potential for local minima smoothing via a stochastic latent space), while removing the inflexible VAE favoritism towards fixedsparsity solutions. Admittedly though, the motivations and claims thus far have been primarily supported by intuition at the expense of rigor; however, in the next section we place VAEase on a sound theoretical footing, with a much closer examination of properties relevant to sparse autoencoding.

4. Comparative Analysis

The notion of adaptive sparsity is particularly relevant when the underlying data distribution is spread across a union

Table 1: Comparison of notable attributes.

adaptive sparsity local min smoothing hyperparm free loss

SAE VAE VAEase

of low dimensional manifolds. Hence we choose this setting as the starting point of our analysis of VAEase. After formalizing concrete assumptions, we move on to examine the degree to which VAEase global minimizers align with ground truth manifold structure. We then conclude by describing a specific scenario whereby the VAEase objective has provably fewer local minima than its SAE counterpart.

4.1. Preliminaries

We now briefly lay out precursory definitions pertaining to our later data and modeling assumptions.

Definition 4.1 (Data on a union of manifolds). Let M := {Mi}n i=1 denote a set of n manifolds Mi Rd, where dim[Mi] = ri 0 and there exists a diffeomorphic mapping ψi : Mi Rri. There is at most one manifold with ri = 0. We then assume data x X M with probability measure ω such that αi := R

Mi ω(dx) > 0 and Pn i=1 αi = 1. Finally, we stipulate that R

Si ω(dx) = 0 for any sub-manifold Si Mi satisfying dim[Si] < ri.

Brown et al. (2022) motivate the practical relevance of data following the spirit of Definition 4.1, along with the additional assumption that there is no overlap among manifolds. In this sense our definition is actually more flexible and the analyses which follow later in Section 4.2 hold with or without overlapping manifolds.3 We next turn to a precise definition of the VAEase model class we seek to analyze. Given a resurgence of SAE use cases, including

Definition 4.2 (Lipschitz VAEase objective). Assume that µz and µx are Lipschitz continuous functions of x X and z Rκ respectively. We then denote LVAEase(ϕ, θ) as the resulting objective from (3), but with pθ x|z replaced by pθ x|z, σz[x; ϕ] as defined in (4).

Lipschitz continuity assumptions for neural network models are quite common (Neyshabur et al., 2017; Bartlett et al., 2017) and are naturally achievable in practice by deep Re LU networks with bounded model weights. For our purposes,

3We are able to accommodate this greater flexibility because our results do not ultimately depend on the dimensionality of the tangent spaces at the boundaries of the support sets. Instead we rely on the optimization of integrals over the entire extent of each manifold such that individual manifold dimensions remain identifiable.

Sparse Autoencoders, Again?

this assumption is merely included for mild technical reasons related to the adopted proof technique.

And finally, before proceeding to our main results we introduce two quantities (with origins stemming back to (Zheng et al., 2022)) that will collectively serve as a useful lens for differentiating VAE and VAEase capabilities.

Definition 4.3 (Active dimension). For any fixed γ, let {θγ, ϕγ} denote a minimizer of loss LVAEase(ϕ, θ) given by Definition 4.2. We then specify that a latent dimension j {1, . . . , κ} is active at sample x X if σ2 z(x; ϕγ)j = O(γ) as γ 0. Additionally, we adopt A(x) to denote the set of active dimensions associated with x.

The remaining so-called inactive dimensions will satisfy σ2 z(x; ϕγ)j = 1 O(γ); as shown in Appendix E.1, other possibilities will not occur almost surely.

Definition 4.4 (Reconstruction error). Borrowing notation and definitions from above, we define the VAEase reconstruction error as

X Eqϕγ (z|x) h x µx(ez; θγ) 2 2

i ω(dx). (5)

We remark that equivalent active dimension and reconstruction error quantities can be defined for regular VAE models as well, the primary difference being LVAE(ϕ, θ) replacing LVAEase(ϕ, θ) as the optimization objective used for obtaining {θγ, ϕγ}.

4.2. Properties of Global Solutions

We now analyze properties of VAEase global minimizers that are particularly relevant to sparse autoencoding.

Theorem 4.5. Assume data adhering to Definition 4.1 with P

i ri κ. Then as γ 0, global solutions to LVAEase(ϕ, θ) achieve R = o(1) and Z

Mi I |A(x)| = ri ω(dx) = αi + o(1) i. (6)

Note that because αi = R

Mi ω(dx) and P i αi = 1, (6) indicates that within each data manifold Mi, the VAEase is almost surely using a number of active dimensions that matches dim[Mi]. Overall then, this result establishes fundamental capabilities of the VAEase model: It can achieve arbitrarily good reconstruction error, all while relying on ideal active dimension counts that perfectly align with each constitute data manifold. Meanwhile, a corresponding VAE cannot achieve the equivalent of Theorem 4.5.

Corollary 4.6. Consider the VAE loss LVAE(ϕ, θ) defined as in (3) using arbitrary Lipschitz continuous encoder/decoder networks (i.e., equivalent to those adopted by the VAEase). Then there exist datasets adhering to Definition 4.1 such that

global minimizers of LVAE(ϕ, θ) in the limit γ 0 satisfy Z

Mi I |A(x)| = ri ω(dx) = Θ(1) (7)

for one or more constituent manifolds Mi.

Critically, this result indicates that VAE models are not always capable of adapting the sparsity of their latent representations to ground-truth structure involving multiple manifolds. Instead, for reasons detailed in Appendix B.1, VAE models tend to favor solutions whereby Z

M I |A(x)| = r ω(dx) 1, (8)

assuming r := P

i ri min(κ, d) (we prove this claim for linear special cases in Appendix E.5 ). As such the VAE can readily estimate the aggregated manifold dimension, generally with a fixed set of active dimensions, but not necessarily that of the finer-grained individual manifolds themselves as our proposed VAEase model can. Complementary experiments in Section 5 will confirm this performance disparity.

4.3. Local Minima Smoothing

Unlike the VAE, with an appropriate selection of h and {λ1, λ2}, the SAE model from (1) can in principle mimic the behavior of the VAEase global optima explored in Section 4.2. As a deterministic counterpart, this involves producing adaptive sparse solutions with negligible reconstruction error and a minimal number of nonzero latent dimensions aligned with each manifold. For example, we can accomplish this goal with h(z) = z 0 and suitable choices for {λ1, λ2} (see Appendix A.2). But SAE adaptive sparsity comes at a cost: In addition to the need for tuning {λ1, λ2} and choosing some workable/smooth approximation to the ℓ0 norm that allows for SGD training, the SAE loss potentially has a much larger constellation of sub-optimal local minimizers. While difficult to explicitly quantify under broad conditions, in this section we introduce a simplified regime whereby an explicit, exponential gap in local minima counts can be established.

Theorem 4.7. Assume κ = d and decoder µx(ez; θ) = Udiag[w]ez, where U Rd d is a fixed/known orthonormal matrix and w Rd are learnable parameters. Moreover, assume arbitrary encoder functions µz and σz. Then the VAEase loss defined w.r.t. an arbitrary input data point x will always have a unique minimum (local or global).4

We now contrast with an analogous SAE model, assuming h H, where H denotes the set of functions that are

4Technically speaking, there are actually multiple equivalent global minima because the overall loss is invariant to randomly flipping the signs of z and w. But this inconsequential ambiguity is shared by all SAE and VAE models alike.

Sparse Autoencoders, Again?

Figure 3: Local minima smoothing example. On the left a 1D slice of the SAE loss (details in Appendix D.1) has two distinct minima appear (red dots). Meanwhile, on the right VAEase under identical conditions has a single (global) minimum because of the latent smoothing afforded by σz.

symmetric about zero, concave and non-decreasing in |z|. Commonly-used sparsity-promoting penalties typically fall into this category (Chen et al., 2017), which includes the ℓ1 and ℓ0 norms, log-based penalties, and many other choices.

Corollary 4.8. Under the same conditions as Theorem 4.7, for any {λ1, λ2} R2 +, there exists data samples x such that the SAE loss from (1) has 2d distinct local minimizers.5

Please see Figure 3 for a 1D visualization of the contrast between Theorem 4.7 and Corollary 4.8. Collectively, these results suggest that the SAE loss is likely more prone to sub-optimal local minima than VAEase. The latter uniquely benefits from a selective smoothing effect introduced by the stochastic latent space, whereby desirable sparse global solutions can be preserved (Theorem 4.5) while at least some distracting local minimizers are smoothed away (Theorem 4.7), ultimately because of the expectation over qθ(z|x). This phenomenon has been previously explored within the narrow context of fixed-sparsity VAE models (Wipf, 2023), but never generalized to the much more challenging adaptive sparsity scenario as we have done.

5. Empirical Validation

In this section we empirically compare VAEase against salient baselines on both synthetic and real-world datasets. Full experimental details, including data generation processes and model settings, are deferred to Appendix D.

5.1. Experiments with Known Ground-Truth

Dataset generation. Synthetic datasets are ideal in the sense that we have access to ground-truth manifold parame-

5If an arbitrary loss L(u) is non-increasing in u R for all u > u

for some constant u , we also treat u = + as a local minima as there is no trajectory leading back to another local minimum without increasing the loss. But this inconsequential caveat only applies when h is bounded from above.

ters for explicitly quantifying model representations. To this end, we extend setups inspired from Zheng et al. (2022); Stanczuk et al. (2024) to generate two datasets composed of multiple manifolds. The first involves 500,000 points randomly assigned to n = 3 linear subspaces of 4 dimensions each embedded an ambient space with d = 40. The second more challenging case involves n = 4 nonlinear manifolds with dimensions {5, 5, 10, 10} embedded in d = 100 dimensional space; 200,000 samples are generated per manifold by passing random noise through different MLP networks.

Baseline models. To compare against VAEase, we train an equivalent VAE baseline as well as three SAE variants differentiated by the penalty function adopted for h in (1). Following prior work, we select SAE-ℓ1 for the standard ℓ1 norm as used by Cunningham et al. (2024), SAE-log whereby h(z) = log(|z| + ϵ) ( ϵ > 0 is a small constant) as motivated in Candes et al. (2008), and SAE-Tk for the strategy of simply choosing the top/largest k elements of z (per data sample) as advocated by Makhzani & Frey (2013); Gao et al. (2024). For the linear subspace dataset we adopt a linear decoder for all models (as is also used in recent application to modeling the intermediate activation layers of LLMs (Chaudhary & Geiger, 2024)); for the more complex nonlinear dataset we instead stack multiple MLP decoder layers. Note that for both datasets we set k, as required by SAE-Tk, to the size of the largest individual manifold.

Evaluations. After training, we evaluate models as follows. For each manifold, we sample new test points not seen during training and feed them into model encoders. We then use the average σz (for VAE models) or average absolute µz (for SAE models), with the averages taken within each manifold, to estimate the active dimensions required. Ideally these averages should correspond with the ground-truth manifold dimension. For all experiments throughout this paper, we separate active from inactive dimensions using a simple, intuitive variance criteria: A partitioning threshold is chosen such that the average variance above (active dims) and below the threshold (inactive dims) is minimum. This concept, adapted from Xia et al. (2015), avoids dependency on subjective thresholds that vary from experiment to experiment. Results are displayed in Table 2, where VAEase clearly outperforms the other baselines. Note that all models produced negligible reconstruction errors ( 10 3 or less). We also remark that SAE-Tk, even when k = 4 on the linear dataset, does not correctly estimate individual manifold dimensions; this is because the model fails to find a consistent set of active dimensions within each manifold.

5.2. Real-World Data

Image data. As we turn to real-world datasets, it is no longer feasible to expect near-zero reconstruction errors

Sparse Autoencoders, Again?

Table 2: Synthetic dataset results. AD# denotes active dims. for M#. Closest result to ground-truth (GT) in bold.

Linear Subspaces Nonlinear Manifolds AD1 AD2 AD3 AD1 AD2 AD3 AD4 GT 4 4 4 5 5 10 10 SAE-ℓ1 10 11 8 20 26 39 29 SAE-log 8 8 8 15 28 39 31 SAE-Tk 10 10 7 10 12 18 19 VAE 13 13 13 17 21 20 19 VAEase 5 4 4 5 5 11 11

or known ground-truth manifold structure. In this underdetermined regime, achieving a target reconstruction error using the fewest active dimensions is the core objective. To this end, we tune SAE hyperparameters such that the reconstruction error roughly matches VAE counterparts and then examine the required number of active dimensions. We apply this approach to MNIST (Deng, 2012) and Fashion MNIST (Xiao et al., 2017) image datasets, both of which are likely to have rich underlying manifold structure. To better equip models for handling image data, we use Re LU convolutional layers for all encoder and decoder networks. Results are shown in Table 3, where we now report the number of active dimensions averaged across all test samples (AD) as well as the reconstruction error (RE) for reference. VAEase achieves strong performance on average as the only method to rank within the top two for all categories. The closest competitor is SAE-Tk; however this approach relies explicitly on access to a data-dependent estimate of the manifold dimension applied through manual selection of k. In contrast, VAEase operates effectively without any such knowledge, a capability that is particularly valuable when obtaining such estimates are costly, or when there is wide variance across data samples.

To further explore VAEase attributes, Figure 4 provides a more fine-grained visualization of Fashion MNIST results. From the displayed curves we observe that VAEase maintains a stable reconstruction error as more and more latent dimensions are sequentially masked out. Results on other datasets (not shown) are similar. As an additional side point, Appendix C demonstrates how VAEase active dimensions correlate with MNIST/Fashion image labels.

Language model activation data. We further explore orthogonal comparisons using a popular large language model (LLM) related use case. Specifically, SAE models have recently been applied to learning sparse reconstructions of the activation patterns produced within intermediate LLM layers for interpretability reasons (Bricken et al., 2023; Cunningham et al., 2024; Chaudhary & Geiger, 2024). In these scenarios there is no explicit ground-truth; however, as before the objective is to minimize the active dimensions sub-

ject to a nominal reconstruction error, the idea being that fewer dimensions are inherently more explainable.6 To this end, using code from (Cunningham et al., 2024) we pass the Pile-10k dataset (Gao et al., 2020) through a Transformer model from the Pythia Scaling Suite (Biderman et al., 2023) and extract 2,098,176 samples from an intermediate activation layer, each with a dimensionality of d = 512. And since the dataset is drawn from highly diverse domains including literature, academic papers, and web text, it is reasonable to assume complex underlying data structure. We then trained VAEase and baseline models on these data and recorded results in Table 3, where we observe that VAEase simultaneously achieves the lowest RE and AD values. We also note that the VAE performance is particularly poor w.r.t. AD. And yet in Appendix B.2 we show that even adding an arbitrarily complex self-attention layer to the otherwise linear decoder cannot fix the core VAE difficulty of identifying high-sparsity codes for LLM activations.

Text embedding data. As a final complementary application domain for evaluating VAEase, we consider text embedding data as formed at the output layer of certain LLM architectures. We choose such data because recent work has shown that sparse representations thereof (as obtainable from SAEs) encode rich semantic and causal relationships that are readily applicable to various forms of hypothesis generation (Movva et al., 2025) as well as interpretability studies (O Neill et al., 2024). Following (Movva et al., 2025), we generate text embeddings for the Yelp dataset using the Modern BERT Embed model (Nussbaum et al., 2024) resulting in 200,000 samples with dimensionality d = 768. We then compare the sparse representations learned from VAEase training against baseline models (SAE-based and VAE) with results shown in Table 3. Similar to the LLM intermediate activation data from before, VAEase achieves significantly lower values for both AD and RE as desired. Meanwhile the VAE exhibits the poorest performance, likely attributable to uniformity in active dimensions across samples. We also remark that SAE-Tk did not yield a smaller RE even when afforded a higher AD for all LLM/text-related experiments. We hypothesize that fixing the number of active dimensions (k=30) inherently constrains the model s capacity to adequately adapt across samples of varying complexity, hindering RE on average.

5.3. Comparison with Diffusion

As mentioned in Section 2.3, through suitable postprocessing DMs are in principle capable of estimating man-

6Pursuing LLM explanations is well outside the scope of our work. Instead, we are merely adopting this use case to showcase differences between VAEase and existing approaches on complex, high-dimensional data. Hence we focus on comparing active dimension counts, not downstream interpretations thereof.

Sparse Autoencoders, Again?

Table 3: Real-world dataset results. RE = reconstruction error; AD = average # active dimensions in test set. Note that SAE-Tk is given a dataset-dependent estimate of the manifold dimension through k, while VAEase requires no such access to prior knowledge and yet still maintains the best overall performance. Top result for each case in bold; second best is underlined.

MNIST Fashion LLM-Act Text-Emb RE AD RE AD RE AD RE AD SAE-ℓ1 10.4 19.1 8.69 19.0 46.2 38.2 0.224 61.9 SAE-log 9.84 18.5 8.95 17.8 45.1 58.2 0.230 56.0 SAE-Tk 10.1 16.0 8.11 15.0 45.1 30.0 0.217 30.0 VAE 9.81 18.0 8.42 18.1 48.0 87.9 0.277 90.0 VAEase 9.71 16.2 8.39 12.4 39.5 22.5 0.187 16.7

Figure 4: Reconstruction error (RE) curve as latent dimensions are sequentially masked out on Fashion MNIST data. VAEase maintains a stable/flat RE with the fewest (nonmasked) active dimensions, followed by a sharper rise once key informative dimensions start being removed. Note that SAE-Tk is not included as its active dimensions are effectively fixed by the selection of k.

ifold dimensions (although not a corresponding mapping onto low-dimensional representations as SAE/VAE-based models do). Ideally, we would like to compare VAEase with diffusion in a real-world context, but we cannot rely on joint RE/AD evaluations as before, since DMs produce no analogous encode-decoder bottleneck reconstruction. Moreover, with real-world data we do not generally have access to ground-truth manifold dimensions that would otherwise facilitate head-to-head comparisons. In fact, even on MNIST there is considerable ambiguity, with wide-ranging estimates from 10 to over 100 (Pope et al., 2021; Tempczyk et al., 2022; Horvat & Pfister, 2024; Stanczuk et al., 2024). So instead we develop the following setup that is novel in the context of manifold dimension estimation.

First, we train a GAN (Goodfellow et al., 2014) with 16dimensional input noise on MNIST data. Once trained, we then generate a pseudo-MNIST dataset by sampling this

Table 4: Pseudo-MNIST comparison with diffusion models.

GT DM (NB) DM (FLIPD) VAEase (AD) 16 105.94 169.81 14.93

model. By construction, these new images will necessarily lie on a manifold with no more than 16 dimensions, i.e., we now have access to a ground-truth upper bound. From here, we train VAEase and a DM on 500,000 samples and compare their respective estimates. For DM we used two recently-proposed post-processing techniques to estimate the manifold dimension: NB (Stanczuk et al., 2024) and FLIPD (Kamkari et al., 2024). Results are shown in Table 4, where VAEase provides a far more accurate estimate, being much closer to any possible value in the feasible range between 1 and 16 than either diffusion approach. We also remark that, because the pseudo-MNIST samples are visually quite similar to the originals (see Appendix D.4), it would appear that r = 16 (or possibly somewhat smaller) is actually a reasonable estimate for the true MNIST manifold dimension. This lends additional credence to the VAEase estimate of 16.2 from Table 3 (the same is not true for SAETk since the model was explicitly provided with k = 16).

6. Conclusions

Despite a lengthy and stable existence, we have argued that there is indeed still justification for revisiting the original design space of sparse autoencoders, again. In doing so we proposed a simple alternative architecture called VAEase, based on a novel retooling of the VAE encoder variance as an adaptive sparsity selector. This refinement allows an otherwise rigid, fixed-sparsity VAE latent space, to vary support patterns across input samples like a canonical SAE, even while retaining a hyperparameter-free energy along with stochastic smoothing of locally-minimizing solutions. And no additional model parameters, beyond those of a vanilla VAE, are required for implementing VAEase. Given the contemporary resurgence of creative SAE use cases, particularly as related to understanding properties vision and language models (Lan et al., 2024; Pach et al., 2025; Stevens et al., 2025) and beyond (Movva et al., 2025), we anticipate that VAEase may have widespread applicability moving forward.

Impact Statement

This paper presents work designed to better understand and enhance the field of sparse autoencoders. While in principle machine learning models of this sort could be used for nefarious purposes, there is nothing specific to our work that stands out as a notable risk factor. And generally speaking, we would argue that by increasing transparency on model behaviors, we at least reduce the risk in inadvertent misuse or misapplication.

Sparse Autoencoders, Again?

Ansuini, A., Laio, A., Macke, J. H., and Zoccolan, D. Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.

Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017.

Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35 (8):1798 1828, 2013.

Berahmand, K., Daneshfar, F., Salehi, E. S., Li, Y., and Xu, Y. Autoencoders and their applications in machine learning: a survey. Artificial Intelligence Review, 57(2): 28, 2024.

Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397 2430. PMLR, 2023.

Bourlard, H. and Kamp, Y. Auto-association by multilayer perceptrons and singular value decomposition. Biological cybernetics, 59(4):291 294, 1988.

Brehmer, J. and Cranmer, K. Flows for simultaneous manifold learning and density estimation. Advances in neural information processing systems, 33:442 453, 2020.

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2, 2023.

Brown, B. C., Caterini, A. L., Ross, B. L., Cresswell, J. C., and Loaiza-Ganem, G. Verifying the union of manifolds hypothesis for image data. ar Xiv preprint ar Xiv:2207.02862, 2022.

Candes, E. J., Wakin, M. B., and Boyd, S. P. Enhancing sparsity by reweighted ℓ1 minimization. Journal of Fourier analysis and applications, 14:877 905, 2008.

Caterini, A. L., Loaiza-Ganem, G., Pleiss, G., and Cunningham, J. P. Rectangular flows for manifold learning. Advances in Neural Information Processing Systems, 34: 30228 30241, 2021.

Chaudhary, M. and Geiger, A. Evaluating open-source sparse autoencoders on disentangling factual knowledge in gpt-2 small. ar Xiv preprint ar Xiv:2409.04478, 2024.

Chen, Y., Ge, D., Wang, M., Wang, Z., Ye, Y., and Yin, H. Strong NP-hardness for sparse optimization with concave penalty functions. In International Confernece on Machine Learning, 2017.

Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. International Conference on Learning Representations, 2024.

Dai, B., Wang, Y., Aston, J., Hua, G., and Wipf, D. Connections with robust PCA and the role of emergent sparsity in variational autoencoder models. Journal of Machine Learning Research, 19(41):1 42, 2018.

Dai, B., Wenliang, L., and Wipf, D. On the value of infinite gradients in variational autoencoder models. Advances in Neural Information Processing Systems, 34:7180 7192, 2021.

Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141 142, 2012.

Fan, J. and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. JASTA, 96(456): 1348 1360, 2001.

Fazel, M., Hindi, H., and Boyd, S. P. Log-det heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices. In Proceedings of the 2003 American Control Conference, 2003., volume 3, pp. 2156 2162. IEEE, 2003.

Fefferman, C., Mitter, S., and Narayanan, H. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016.

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. ar Xiv preprint ar Xiv:2101.00027, 2020.

Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. ar Xiv preprint ar Xiv:2406.04093, 2024.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks. In ar Xiv preprint ar Xiv:1406.2661, 2014.

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, Chapter 14. MIT Press, 2016.

Sparse Autoencoders, Again?

Hinton, G. and Zemel, R. Autoencoders, minimum description length and helmholtz free energy. Advances in Neural Information Processing Systems, 6, 1993.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

Horvat, C. and Pfister, J.-P. On gauge freedom, conservativity and intrinsic dimensionality estimation in diffusion models. ar Xiv preprint ar Xiv:2402.03845, 2024.

Kamkari, H., Ross, B. L., Hosseinzadeh, R., Cresswell, J. C., and Loaiza-Ganem, G. A geometric view of data complexity: Efficient local intrinsic dimension estimation with diffusion models. ar Xiv preprint ar Xiv:2406.03537, 2024.

Kingma, D. and Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.

Kingma, D., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. Advances in Neural Information Processing Systems, 34:21696 21707, 2021.

Lan, M., Torr, P., Meek, A., Khakzar, A., Krueger, D., and Barez, F. Sparse autoencoders reveal universal feature spaces across large language models. ar Xiv preprint ar Xiv:2410.06981, 2024.

Le Cun, Y. Modèles connexionnistes de l apprentissage. Ph D thesis, Université de Paris VI, 1987.

Luo, C. Understanding diffusion models: A unified perspective. arxiv 2022. ar Xiv preprint ar Xiv:2208.11970.

Makhzani, A. and Frey, B. K-sparse autoencoders. ar Xiv preprint ar Xiv:1312.5663, 2013.

Mathieu, E. and Nickel, M. Riemannian continuous normalizing flows. Advances in Neural Information Processing Systems, 33:2503 2515, 2020.

Movva, R., Peng, K., Garg, N., Kleinberg, J., and Pierson, E. Sparse autoencoders for hypothesis generation. ar Xiv preprint ar Xiv:2502.04382, 2025.

Neyshabur, B., Bhojanapalli, S., Mc Allester, D., and Srebro, N. Exploring generalization in deep learning. Advances in neural information processing systems, 30, 2017.

Ng, A. Sparse autoencoder. CS294A Lecture notes, 72 (2011):1 19, 2011.

Nussbaum, Z., Morris, J. X., Duderstadt, B., and Mulyar, A. Nomic embed: Training a reproducible long context text embedder, 2024.

O Neill, C., Ye, C., Iyer, K., and Wu, J. F. Disentangling dense embeddings with sparse autoencoders. ar Xiv preprint ar Xiv:2408.00657, 2024.

Pach, M., Karthik, S., Bouniot, Q., Belongie, S., and Akata, Z. Sparse autoencoders learn monosemantic features in vision-language models. ar Xiv preprint ar Xiv:2504.02821, 2025.

Palmer, J., Wipf, D., Kreutz-Delgado, K., and Rao, B. Variational EM algorithms for non-Gaussian latent variable models. Advances in Neural Information Processing Systems, pp. 1059 1066, 2006.

Ponti, E. M. and Martins, A. Dynamic sparsity in machine learning. Neur IPS Tutorial, 2024.

Pope, P., Zhu, C., Abdelkader, A., Goldblum, M., and Goldstein, T. The intrinsic dimension of images and its impact on learning. ar Xiv preprint ar Xiv:2104.08894, 2021.

Ranzato, M., Poultney, C., Chopra, S., and Le Cun, Y. Efficient learning of sparse representations with an energybased model. Advances in Neural Information Processing Systems, 19, 2006.

Ranzato, M., Boureau, Y.-L., Le Cun, Y., et al. Sparse feature learning for deep belief networks. Advances in Neural Information Processing Systems, 20, 2007.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.

Rumelhart, D., Hinton, G., and Williams, R. Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the Microstructures of Cognition, MIT Press, Vol. I:318 362, 1986.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265, 2015.

Stanczuk, J. P., Batzolis, G., Deveney, T., and Schönlieb, C.-B. Diffusion models encode the intrinsic dimension of data manifolds. In Forty-first International Conference on Machine Learning, 2024.

Stevens, S., Chao, W.-L., Berger-Wolf, T., and Su, Y. Sparse autoencoders for scientifically rigorous interpretation of vision models. ar Xiv preprint ar Xiv:2502.06755, 2025.

Tempczyk, P., Michaluk, R., Garncarek, L., Spurek, P., Tabor, J., and Golinski, A. Lidl: Local intrinsic dimension estimation using approximate likelihood. In International Conference on Machine Learning, pp. 21205 21231. PMLR, 2022.

Sparse Autoencoders, Again?

Wipf, D. Marginalization is not marginal: No bad VAE local minima when learning optimal sparse representations. In International Conference on Machine Learning, pp. 37108 37132, 2023.

Xia, Y., Cao, X., Wen, F., Hua, G., and Sun, J. Learning discriminative reconstructions for unsupervised outlier removal. In Proceedings of the IEEE international conference on computer vision, pp. 1511 1519, 2015.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Zheng, Y., He, T., Qiu, Y., and Wipf, D. P. Learning manifold dimensions with conditional variational autoencoders. Advances in Neural Information Processing Systems, 35:34709 34721, 2022.

Sparse Autoencoders, Again?

A. Further Analysis of Sparse Autoencoding (SAE) Models

In this section we further elaborate relevant SAE modeling details that were omitted from the main text for space considerations. Later in Appendix B we provide complementary information regarding VAE models.

A.1. Degeneracies in Deterministic SAE Models

We will now introduce a simplified scenario whereby minima of the SAE loss from (1) can have degenerate minima that undermine the central purpose of sparse autoencoding. Consider the case where the SAE decoder network is given by the linear function µx z; θ = Wz, with θ = W Rd κ. Furthermore, for some input sample x, define

z := arg min z x Wz 2 2. (9)

In general z will not have many or most values equal to or nearly zero; however, we can always fabricate such a solution by a trivial transformation. Specifically, we can make all elements of z arbitrarily small by the rescaling z αz 0, while maintaining an equivalent reconstruction error of x via the revised weights W α 1W. In this way we can reduce any smooth penalty factor h(z), with minimum at zero, to an arbitrarily small value just by rescaling. But of course this defeats the whole purpose of including a sparse penalty in the first place. It is precisely because of this possibility that we must introduce θ 2 2 into (1); by including this factor we limit the compensatory growth of θ = W such that h can produce non-trivial sparse solutions as desired.

While this particular scenario with a linear decoder is chosen for transparency, it nonetheless illustrates a broader complication with this type of model. Even if we were to deepen the decoder network by stacking additional nonlinear layers, the same scaling ambiguity will generally exist unless some form of restriction on θ is included.

One exception to this rule is when h is replaced by a top-k constraint (Gao et al., 2024; Makhzani & Frey, 2013) as mentioned in Section 2.1. Although this substitution mitigates scaling degeneracies, it introduces new complications. First, the top-k constraint produces discontinuities in the loss surface that may interfere with timely convergence of the learning process. But secondly, and perhaps more importantly, it forces the latent representation of every input x to have exactly the same number of degrees-of-freedom (i.e., k), regardless of varying levels of input complexity.

A.2. Maximal Sparsity Using an Infeasible SAE Model

As a point of reference, it is worthwhile to consider an SAE model capable of idealized adaptive sparsity akin to VAEase. To this end, consider the SAE loss from (1) actualized with h(z) = z 0 and λ2 = 0. In this case we have

LSAE(ϕ, θ) = Z

n x µx z; θ 2

2 + λ1 z 0 o ω(dx) Z

x µx z; θ 2

2 + z 0 o ω(dx) (10)

subject to z = µz(x; ϕ). In the limit as λ1 becomes small, minimizing LSAE(ϕ, θ) in this form is asymptotically equivalent to solving the constrained problem

X z 0 ω(dx) s.t. x = µx z; θ , z = µz(x; ϕ) x X. (11)

In this way, the SAE model can be designed to enforce perfect reconstructions using the minimal number of nonzero latent dimensions. Such representations will generally correspond with the ground-truth manifold dimensions for data drawn according to Definition 4.1 along with Lipschitz continuous encoder and decoder networks µz and µx. While conceptually notable, of course minimizing (11) is neither differentiable nor feasible to optimize, unlike our VAEase approach. We visualize SAE approximate solutions in Appendix A.3 below.

Sparse Autoencoders, Again?

A.3. Approximate Sparsity Using SAE-ℓ1 and SAE-log Models

Figure 5: Because the ideal ℓ0 norm cannot be optimized via SGD, some form of smooth approximation is required by SAE models in practice. Here we show histograms of |z| values (on a log scale) for SAE-ℓ1 (left) and SAE-log (right) on MNIST data. Because the weaker convex ℓ1 very loosely approximates the ℓ0 norm, there is no clear-cut partitioning between active (large values) and inactive (very small) dimensions. In contrast, the non-convex log-based penalty creates a clear separation. This separation is readily detectable by the variance criteria described in Section 5 as adopted in our experiments (applied in log-space).

B. Further Analysis of VAE-based Models

This section provides more intuitive explanations for why VAEs struggle with adaptive sparsity, while VAEase does not.

B.1. Why Fixed Sparsity in Existing VAE models

Linear Decoder Case: We illustrate the VAE tendency towards fixed sparsity using a linear decoder mean network given by µx z; θ = Wz, with θ = W Rd κ, analogous to the setup from Appendix A.1 although now reapplied to a new context. Granting this assumption and an arbitrary input sample x, the VAE reconstruction loss will satisfy

Eqϕ(z|x) log pθ(x|z) Eqϕ(z|x) h 1 γ x Wz 2 2

i = 1 γ x Wµz(x; ϕ) 2 2 +

σz(x; ϕ)2 j γ w:j 2 2, (12)

where w:j denotes the j-th column of W and the first equivalence ignores constant factors independent of W. As mentioned in Section 2.2 and proven in prior work, for inactive dimensions we have qθ(zj|x) N(zj|0, 1) which implies σz(x; ϕ)j 1. Hence the only way to avoid a divergent reconstruction loss as γ 0 is to have w:j 0 for inactive dimensions. In contrast, for an active dimension j , recall that qθ(zj |x) N zj |µz(x; ϕ)j , O(γ) . And so from (12) the factor σz(x; ϕ)2 j γ w:j 2 2 = O(γ)

γ w:j 2 2 = O(1) w:j 2 2 (13)

will remain well-behaved even for w:j = 0. And then of course since W is shared across all x X, zeroand nonzero-valued channels (i.e., columns of W) will remain permanently so for all inputs leading to a fixed sparsity solution.

Nonlinear Decoder Case: When we generalize to nonlinear decoders, the problem raised above largely persists, although the situation becomes more nuanced. The added wrinkle in the nonlinear case is that a sufficiently complex decoder could in principle learn a form of partial adaptive sparsity pattern unlike in the linear setting, although with notable caveats that undermine practical realizations. Why is there such a gap between principle and practice here? To answer this question, it helps to contrast the core challenge facing an SAE decoder versus a VAE decoder.

Given a well-trained SAE model with suitable latent-space regularization, we may assume that z = µz(x; ϕ) is a sparse vector in Rκ. This implies that inactive dimensions can be trivially identified by simply checking which entries are equal to zero (or at least nearly so vis-à-vis a small threshold). Now consider a corresponding VAE model producing a latent representation with the exact same µx(x; ϕ), now serving as a posterior mean network, along with the posterior σz(x; ϕ) variance satisfying

σz(x; ϕ)j 0, j = active dim 1, j = inactive dim . (14)

Sparse Autoencoders, Again?

In this revised situation, what were once zero-valued inactive dimensions of z are now filled with samples from N(0, 1), and are thus much less distinguishable from the active dimensions. Figure 2 in the main text provides a toy illustration of this contrast. Because of this ambiguity, with limited capacity in the linear decoder setting from above, the only sure-fire way to differentiate active and inactive dimensions is to permanently encode a shared partition by zeroing out columns of W. But in the nonlinear case there is another subtle option: One or more dimensions of z can be permanently designated as specialized active dimensions (just as in the linear case) but with an expanded, two-fold role that accommodates adaptive sparsity across the remaining decoder dimensions.

Specifically, to instantiate adaptive sparsity within a standard VAE architecture, there must exist a nonlinear decoder with ρ 1 permanently active dimensions; the remaining κ ρ dimensions can then reflect adaptive sparsity assuming suitable decoder capacity. Of these ρ permanently active dimensions, their role is two-fold:

1. Assist with reconstructions, i.e., the typical role of any ordinary active dimension, and

2. Designate which of the remaining κ ρ dimensions are active.

The latter is critical for instantiating adaptive sparsity, since otherwise it is not possible for the decoder to know with certainty which dimensions are active on an input-by-input basis. But of course actually implementing a decoder capable of inducing specialized dimensions that simultaneously fulfill the dual roles from above is quite challenging, and requires significant additional complexity, even if the data manifold(s) upon which x X rest are relatively simple, e.g., a union of linear subspaces. And once such complexity is introduced, the risk of the VAE overfitting to finite-sample training data increases significantly. This is because, when granted sufficient decoder capacity, a VAE model applied to a finite sample training set can achieve perfect reconstructions using only a single active dimension, regardless of the true manifold structure (Dai et al., 2018)[Theorem 5]. Or stated differently, if VAE decoder capacity is expanded to instantiate adaptive sparsity, then this same capacity can be hijacked to trivially overfit to finite sample datasets even when the underlying manifold structure is relatively simple. We discuss how the VAEase architecture avoids this conundrum in Appendix B.3.

B.2. Can Self-Attention Boost Linear Cases?

In Appendix B.1 we demonstrated why VAEs with the linear decoder µx z; θ = Wz will necessarily lead to fixed-sparsity representations. This is the same decoder assumed by recent models of LLM activation patterns adopted for interpretability purposes (Cunningham et al., 2024). Therefore as things presently stand, a VAE is not suitable for handling LLM activation data where adaptive sparse representations are required (indeed this explains the poor VAE performance on the LLM data from Table 3. However, it is still reasonable to ask, what if we were to modify the VAE encoder to

µx z; θ = Wτ(z), (15)

where τ : Rκ 7 Rκ is an arbitrary parameterized function, e.g., a self-attention layer capable of learning adaptive sparse combinations of the columns of W. In this way we could presumably retain the interpretability provided by the original linear decoder basis vectors, while avoiding rigid, fixed-sparsity solutions. Unfortunately though, this approach will be ineffectual because there can exist an infinite space of VAE global minimizers whereby τ(z) is not sparse. The reason is that VAE sparsity is induced by KL regularization as applied to the posterior mean and variance of z, but τ operates solely within the decoder where no regularization effect of any kind exists. Hence even if the informative dimensions of z are sparse, τ(z) can map them to an arbitrary non-sparse representation that nonetheless minimizes the reconstruction error and globally minimizes the overall VAE loss. And provided that W is overcomplete, which it is by design for LLM interpretability applications, it will necessarily have a non-trivial null space such that infinite solutions with zero reconstruction error are possible.

B.3. How VAEase Circumvents Fixed Sparsity

To provide more intuition into how VAEase avoids producing fixed sparsity patterns, we extend our examination of the linear decoder setting from Appendix B.1. Under these circumstances, the VAEase version of (12) is modified to

Eqϕ(z|x) log pθ(x|z) = 1 γ x Wµz(x; ϕ) 2 2 +

σz(x; ϕ)2 j(1 σz(x; ϕ)2 j) γ w:j 2 2. (16)

Sparse Autoencoders, Again?

From this expression we observe that the σz-dependent regularization factor can now be minimized when either σz(x; ϕ)2 j 1 for an inactive dimension, or σz(x; ϕ)2 j 0 for an active dimension. This allows all columns of W to potentially remain nonzero which then accommodates input-adaptive sparse solutions. This same strategy naturaly transitions seamlessly to nonlinear decoders as well, since inactive dimensions will automatically compress noise to zero regardless of the decoder structure.

Before concluding this section, it is worth considering a possible alternative way of achieving adaptive sparsity. Specifically, suppose instead of the VAEase technique of swapping ez for z, we simply concatenate σz along with z to form the decoder function µx [z ; σz(x; ϕ)] ; θ . As it turns out though, this strategy is not viable, since now the decoder can just learn to complete ignore z and use σz directly for input reconstruction. This will incur only a O(1) penalty cost from the KL loss term, while allowing for negligible reconstruction error as measured by (5), provided κ is sufficiently large. The associated log γ normalization factor (as shared by VAE models) will then tend towards minus infinity, pushing the overall loss towards minus infinity. Moreover, the KL loss applied in this way has no favoritism towards sparse solutions, and so the whole enterprise will not be successful in aligning with manifold structure.

C. Exploring Data Label Alignment with VAEase Active Dimensions

In following a manifold hypothesis for high-dimensional data (Bengio et al., 2013), numerous studies have focused on estimating manifold dimensions (Ansuini et al., 2019; Brehmer & Cranmer, 2020; Mathieu & Nickel, 2020; Caterini et al., 2021). Along these lines it has been pointed out that the intrinsic dimension of a dataset (akin to the manifold dimension under the manifold assumption), rather than the ambient dimension, plays a more crucial role in model fitting (Fefferman et al., 2016; Pope et al., 2021). Subsequently, a union of manifolds hypothesis has been introduced to extend the original manifold hypothesis (Brown et al., 2022); this extension aligns more closely with intuitions regarding real-world labeled datasets. Moreover, companion studies indicate that local intrinsic dimension of data points may vary across datasets, naturally so for points belonging to different classes.

Conceptually, data points with the same label likely exhibit inherent consistency in associated features, although assuming that their sample set forms a single manifold is unlikely to hold in general. Thus, following (Brown et al., 2022), we use the VAEase models trained in Section 5.2 to estimate the active dimensions of the MNIST and Fashion MNIST datasets, and group the data by labels. We then compute the average differences in active dimensions within and between groups, with the results presented in Table 5. The difference is computed as |A B| |A B| on two sets A and B.

Table 5: Average differences in VAEase active dimensions within and between groups defined by class labels.

datasets Intra-class Inter-class MNIST 0.3091 0.4861 Fashion MNIST 0.6761 2.1143

The result in the Table 5 suggests that the labels in MNIST bear some significance w.r.t. partitioning of the manifolds; even more so for Fashion MNIST. This is despite the fact that model training is completely independent of the class labels themselves.

D. Experimental Settings

D.1. 1D Local Minima Plots from Figure 3

For SAE, we instantiate the energy from ((1) using λ1 = λ2 = 0.1 and h(z) = log(|z| + ϵ) where ϵ = 1e 4. For VAEase we fix γ = 0.1. We treat µz as a free parameter for both models; likewise for σz as required by VAEase. For the decoders we adopt µx = wz, where w is a learnable parameter that has been optimized out of both SAE and VAEase models prior to plotting the figures.

D.2. Synthetic Datasets

Data generation. For the MLP dataset, the data in each manifold are mapped by a MLP containing 2 linear layers with activation leaky-Re LU. Specifically, the input dimension of the MLPs are ri aligned to the respective manifold dimensions stated in main text. The dimensions of hidden layers follow the same ri. The output dimensions are d same as the ambient

Sparse Autoencoders, Again?

dimension. The slope of leaky-Re LU is 0.2.

Model architectures and training. For the linear dataset, the encoders contain 4 linear layers connected by the activation function Swish, and the decoders are a single linear layer. The hidden dimension in encoders is set as 2κ. For the MLP dataset, the encoders contain 3 residual blocks and each block consists 3 linear layers, and the decoders are 2 layer MLPs activated by leaky Re LU with slope 0.2. The hidden dimension of residual blocks is 8κ. Models were trained for 150 epochs on linear dataset and 310 epochs on the MLP dataset. The batch size was set to 1024. We choose κ = 20 for the linear dataset and κ = 60 for the MLP dataset. The learning rate for VAE models was 0.01 on linear dataset and 0.005 on the MLP dataset, while the learning rates for SAE models are 0.002 on linear dataset and 0.005 on the MLP dataset. The optimizer is Adam and learning rate scheduler is Cosine Annealing Warm Restarts with T0 = 10. The penalty weights for SAE models in (1) are {λ1 = 1e 3, λ2 = 1e 4} on linear dataset. On the MLP dataset, the weights are {λ1 = 5e 4, λ2 = 1e 5} and {λ1 = 5e 6, λ2 = 1e 5} for SAE-ℓ1 and SAE-log models. These hyperparameters are chosen to produce sufficiently small reconstruction errors ( 1e 3) while still enforcing sparsity as needed to facilitate head-to-head comparison with VAEs (which also produce negligible reconstruction error without any loss tuning parameters, i.e., γ is learned during training).

D.3. Real-World Datasets

MNIST and Fashion MNIST. Model encoders for the MNIST dataset contain 1 input convolution layer and 2 residual blocks, where each block has 2 convolution layers. κ is 32. The decoders consist of 1 Dense NN, 1 upsampling layer, 2 residual blocks and a convolution layer. The residual block here also has 2 convolution layers. Models were trained for 300 epochs in the MNIST dataset using batch size 2048. Learning rates are 0.05. The optimizer is Adam and learning rate scheduler is Cosine Annealing Warm Restarts with T0 = 20. The penalty weights are {λ1 = 5e 1, λ2 = 1e 1} and {λ1 = 1e 2, λ2 = 5e 2} for SAE-ℓ1 and SAE-log models on MNIST dataset, {λ1 = 2e 2, λ2 = 2e 2} and {λ1 = 1.3e 2, λ2 = 2e 2} on Fashion MNIST dataset correspondingly. Additionally, the estimated |z| values follow a heavy-tailed distribution as shown in Figure 5, so it is more reasonable to apply the variance criteria w.r.t. log |z| for dividing active and inactive dimensions.

LLM intermediate-layer activations. As for data, online code is available for gathering the intermediate activation layers.7. For model architectures, the encoders for LLM dataset consist of only one linear layer and a Re LU layer, and the decoders are simply linear transforms consistent with popular use cases (Chaudhary & Geiger, 2024). For VAE models, there is also a linear layer as needed to output σz. Models were trained for 35 epochs with batch-size 2048. The learning rate is 0.005 and κ is 300. The learning rate scheduler is Cosine Annealing Warm Restarts with T0 = 5. The penalty weights on this dataset are {λ1 = 3.2e 1, λ2 = 2e 1} and {λ1 = 1e 2, λ2 = 2e 1} for SAE-ℓ1 and SAE-log models, respectively, tuned for similar reconstruction error.

Yelp text embedding dataset. The Yelp dataset was obtained from hugging face.8 The model architectures used for these experiments are the same as those from above as applied LLM intermediate activations. Models were trained for 150 epochs with a batch size of 512. The learning rate is 0.001 and κ is 512. The learning rate scheduler is Cosine Annealing Warm Restarts with T0 = 10. The penalty weights are {λ1 = 3e 3, λ2 = 1e 2} and {λ1 = 2e 4, λ2 = 5e 4} for SAE-ℓ1 and SAE-log models, respectively, tuned for similar reconstruction error as in previous experiments.

Pesudo-MNIST. We trained a GAN on MNIST for 500 epochs using publicly-available code.9 Subsequently we generated 500,000 new samples to form the pseudo-MNIST dataset. We adopted the same encode/decoder architecture for VAEase as used for the original MNIST data. We then trained VAEase model for 35 epochs on this dataset using batch size 512. κ is 100 and the learning rate is 0.005. The optimizer is Adam and learning rate scheduler is Cosine Annealing Warm Restarts with T0 = 50. For the diffusion model, we adopted the code10 associated with Stanczuk et al. (2024) and trained for 70 epochs. The NB post-processing manifold dimension estimator is also implemented using the same repo. Meanwhile, the FLIPD estimator originates from a separate repo.11

7https://github.com/Hoagy C/sparse_coding 8https://huggingface.co/datasets/rmovva/Hypothe SAEs 9https://github.com/eriklindernoren/Py Torch-GAN 10https://github.com/GBATZOLIS/ID-diff 11https://github.com/layer6ai-labs/diffusion_memorization/blob/main/flipd_utils.py

Sparse Autoencoders, Again?

D.4. Representative Samples from Pseudo-MNIST

Figure 6: Samples from Pesudo-MNIST. On the left we show samples from the original MNIST, while on the right are samples from the proposed pesudo-MNIST; the latter are generated by a GAN trained on MNIST with a known upper bound on the manifold dimension. Because visually the two sets of samples are quite similar, it increases our credence that the pseudo-MNIST data manifold dimension is a reasonable approximation to that of the original MNIST.

E. Technical Proofs

E.1. Proof of Theorem 4.5

We first give a feasible solution and compute the loss rate, which could be the upper rate bound for the optimal loss. Then we prove that there is no solution to achieve a lower loss bound, and derive necessary conditions for the rate-optimal solution by analyzing the order of each term in the loss function.

A feasible solution and upper bound. First, we will construct a solution to give a upper bound of optimal loss rate. For any manifold Mi M, there exists an L-Lipschitz invertible function ψi( ) that maps Mi to Rri and ψ 1 i (0) = 0. Then

we can construct µz, Σz def = diag[σ2 z] and µx s follows:

µz = (I(x M1)ψ1(x) /(1 γ1/2), . . . , I(x Mn)ψn(x) /(1 γ1/2), 0 ) ,

Σz = diag{(e r1 I(x M1)(1 γ)e r1, . . . , e rn I(x Mn)(1 γ)e rn, e κ P ri) },

i ψ 1 i ({(I Σ1/2 z )z}Pi 1 j=1 rj:Pi 1 j=1 rj+ri),

where 0 is an all-zero vector of dimension κ P

i ri and er denotes an all-ones vector of dimension r. As the sum of multiple L-Lipschitz functions, µx is naturally also L-Lipschitz on set M, satisfying Definition 4.2. Now compute the corresponding loss function to check if it reaches the lower bound we discussed.

M Eqϕ(z|x)[ x µx 2]ω(dx) = X

Mi Eqϕ(z|x)[ ψ 1 i (ψi(x)) ψ 1 i (ψi(x) + (1 γ1/2)γ1/2εi) 2]ω(dx)

Mi L2Eεi[ ψi(x) ψi(x) + (1 γ1/2)γ1/2εi 2]ω(dx)

Mi L2(1 γ1/2)2γriω(dx) = L2 X

i (1 γ1/2)2γriαi

Sparse Autoencoders, Again?

Then the KL term can be computed as

M (tr(Σz) log |Σz| + µ z µz κ)ω(dx)

Mi (γri + (κ ri) ri log γ + µ z µz κ)ω(dx)

Mi ((γ 1)ri ri log γ + L2x x)ω(dx)

i αiri log γ + L2 Z

M x xω(dx) + O(1)

Hence the upper bound of optimal loss rate is (d P

i αiri) log γ/2 + O(1), obtained by perfect reconstruction function µx, µz and ri instances of σzj = O(γ) for any x.

Note that we construct this solution without explicitly considering the intersection of the manifolds for the following reasons. First, if the region of intersection is in a lower dimensional manifold, then according to Definition 4.1 it has zero probability measure and therefore contributes nothing to the overall loss. And secondly, if the intersection dimension matches the dimensions of the corresponding manifolds that are interesting, then these manifolds can be merged into a single manifold. Either way, we are able to integrate the loss on well-defined manifolds one by one.

Optimal rate and associated necessary conditions. We will now show that the above loss rate is also optimal and the active dimension is equal to the local manifold dimension almost sure. Consider the loss function in VAEase model,

LVAEase(x; ϕ, θ, γ) = Eqϕ(z|x) [log pθ(x|z)] + KL(qϕ(z|x)||p(z)),

2 log(2πγ) + Eqϕ(z|x)

2 tr(Σz) log |Σz| + µ z µz κ

Assume there exists a solution that can achieve a rate smaller than (d P

i αiri) log γ/2 + O(1). Since terms tr(Σz) + µ z µz κ and reconstruction error are O(1) in the shown solution, to reduce the order of the loss, one can only reduce the term log |Σz|.

Optimal rate for active dimensions. Suppose there are a set X Mi satisfying that, for any x X , log |Σz| = riΩ( log γ) and R

X ω(dx) > 0. Then the number of active dimensions for x must be less than ri. Suppose that there are ri 1 active dimensions for which σ2 z(x; ϕγ)j = O(γ), and there is one dimension for which σ2 z(x; ϕγ)j = O(g(γ)) > O(γ) 0 as γ 0 (if there is no other dimension j satisfying limγ 0 σ2 z(x; ϕγ)j = 0, we assert the reconstruction error would be Θ(1), resulting in infinite loss). We will show that the reconstruction error cannot remain at O(γ) in this case.

Since X is a subset of the manifold Mi and its measure is not zero, at least ri dimensions of information are needed to reconstruct X with Lipschitz functions. Without loss of generality, we set the channels used for reconstruction to the first ri dimensions, i.e., µ(x) = µ(x1:ri). To get the lower reconstruction loss, the function µ must map Rri to X . In addition, there exist a compact set Z Rri and a positive constant l satisfying µ(z1) µ(z2) l z1 z2 for any z1, z2 Z.

Sparse Autoencoders, Again?

Then we have Z

X Eε N(0,I)[ x µx((I Σ1/2 z )µz + (I Σ1/2 z )Σ1/2 z ε) 2]ω(dx)

X Eε N(0,I)[ x µx((I Σ1/2 z )µz + (I Σ1/2 z )Σ1/2 z ε) 2| ε 1]ω(dx)

X Eε N(0,I) x µx((I Σ1/2 z )µz + (I Σ1/2 z )Σ1/2 z ε) 2

{ ε 1} {B1 (I Σ1/2 z )µz, (I Σ1/2 z )Σ1/2 z Z} ω(dx)

Rκ(l2tr((I Σ1/2 z )2Σzε2))I({ ε 1} {B1 (I Σ1/2 z )µz, (I Σ1/2 z )Σ1/2 z Z})ωN(dε)ω(dx)

i σ2 z(x; ϕγ)j(1 σz(x; ϕγ)j)2)I(B1 (I Σ1/2 z )µz(x), (I Σ1/2 z )Σ1/2 z Z)ω(dx) (17)

c1O(g(γ)) Z

where B1((I Σ1/2 z )µz(x), (I Σ1/2 z )Σ1/2 z ) is a l1 norm ball and R

Mi I(B1((I Σ1/2 z )µz(x), (I Σ1/2 z )Σ1/2 z ) Z)ω(dx) could go to αi when l is sufficiently small. The total loss on set X should be in order {O(g(γ))/γ + d log γ (ri 1) log γ log g(γ) + O(1)} R

X ω(dx)/2. Since O(g(γ))/γ log g(γ) O(1) + log γ, there is no lower order for the feasible loss for any g(γ) > O(γ). However, the equation also shows that when R

X ω(dx) = γ/O(g(γ)) 0, the loss difference could be absorbed by O(1). In other words, the number of active dimensions in x will match the manifold dimensions, with a probability of almost 1 as γ 0.

Optimal rate for inactive dimensions. First, from the above analysis, the rates of variance on active dimensions converge to 0. In contrast, the rate for inactive dimensions should be Θ(1). However, it is not sufficient to ensure the effectiveness of our revamping mechanism, which is expected as 1 σj 0 for inactive dimensions. Formula17 implies the rate requirement of O(γ) reconstruction loss on inactive dimensions.

Suppose that the inactive dimension j has no contribution in function µ( ) on set Z , which means µ(x) = µ(x ) for all (x, x ) and x, x Z are only different at dimension j . Then the reconstruction loss is independent with σ2 j and the left relative terms in loss function are σ2 j log σ2 j optimized at σj = 1.

Otherwise, with some constant C, consider that there exists a subset Zi Zi = {z|z = (I Σ1/2 z )z + (I Σ1/2 z )Σ1/2 z ε, z = µz(x ), x Mi , ε C}, where an inactive dimension j contributes to function µx( ), also µx(z)/ zj = 0. Combined with the continuity assumption, we could obtain a subset Zi (l ) Zi that µx(z) µx(z ) l |zj z j | for z, z Z. Let 1 σj = g (γ) 0, we then have a similar analysis in formula 17 as

Mi Eε N(0,I)[ x µx((I Σ1/2 z )µz + (I Σ1/2 z )Σ1/2 z ε) 2]ω(dx)

Mi Eε N(0,I)[ x µx((I Σ1/2 z )µz + (I Σ1/2 z )Σ1/2 z ε) 2| ε C]ω(dx)

Mi Eε N(0,I)[ x µx((I Σ1/2 z )µz + (I Σ1/2 z )Σ1/2 z ε) 2|

{ ε C} {(I Σ1/2 z )µz + (I Σ1/2 z )Σ1/2 z ε Zi (l )}]ω(dx)

ε C l 2(1 σj )2σ2 j ε2 j I({(I Σ1/2 z )µz + (I Σ1/2 z )Σ1/2 z ε Zi (l )})ωN(dε)ω(dx)

c2l 2g (γ)2 Z

ε C I({(I Σ1/2 z )µz + (I Σ1/2 z )Σ1/2 z ε Zi (l )})ωN(dε)ω(dx).

Sparse Autoencoders, Again?

Denote the last integral terms by Ai (C, l ), then is is held that Ai (C, l ) R

Mi I(z Zi )ω(dx) when l is sufficiently

small. To make sure the reconstruction loos is O(γ), we could obtain that if Ai (C, l ) c > 0 as γ 0, g (γ) = O(γ1/2) must hold; if Ai (C, l ) 0, g (γ) = O(γ1/2) could be violated almost with a probability of 0.

Mi I(z Zi )ω(dx) R

Mi ω(dx) with sufficiently large C and Zi Zi Z is the set where dimension j

contributes nothing to µx( ), we finalize the proof that σ2 j = 1 O(γ) holds almost with a probability of 1.

E.2. Proof of Corollary 4.6

The proof is related to that of Theorem 4.5, but the upper bound of the loss differs. Without the additional (1 σz) to set inactive dimensions to zero, the feasible solution in VAEase model is invalid. Here we consider a simple dataset combined with a point dataset and a line dataset, and κ = 1 is the dimension of latent variables. Assume there exists a solution satisfying Z

Mi I(|A(x)| = ri)ω(dx) = αi + o(1).

To avoid the loss function diverge to + , a necessary condition is R = O(γ). For the point manifold, there should be no active dimensions. Since M0 only contains one point x0, to make the reconstruction error is O(γ), it leads to P( µx(z) x0 1) R = O(γ) by Chebyshev s inequality. For the line manifold, it should be held that x = µx(µz(x)) + O( γ).

Assume the point manifold is on the line and consider the data point around x0. For the a compact set X := {x| x x0 1}, we know preimage of set X under Lipschitz mapping µx have probability no more than O(γ) under the distribution N(µz(x0), 1). This means that the preimage Z, as a compact set on R, must satisfy Z

Z exp( (z µz(x0))2/2)dz = O(γ). (18)

Due to the Lipschitz continuity of µz, consider x0 and x satisfying x x0 = 1, then |µz(x ) µz(x0)| should be bounded by some constant. Then compact set Z should be small than O(γ) under the Lagrangian measure to keep (E.2) fixed. Then there exist a x X satisfying x x > c for some positive constant c. So we finally get µx(µz(x )) µx(µz(x )) L µz(x ) µz(x ) because the decoder mean function is Lipschitz. As a result, we obtain L (c O( γ))/O(γ) with probability almost 1, which means the decoder mean function is not Lipschitz almost surely. From this conflict, we can conclude that there is no solution that satisfies adaptive sparsity for VAE on this dataset.

E.3. Proof of Theorem 4.7

Since the U is known, the optimization problem can be conveniently decoupled into d independent one-dimensional optimization problems (this stems from the fact that the Frobenius norm is invariant to orthonormal transformations). So we just need to verify that there is a unique minima for each one-dimensional problem. The loss function for single point in VAEase model is

LVAEase(x; w, µz, σ2 z) = {d log(2πγ) + x w(1 σz)µz 2 2/γ + w2σ2 z(1 σz)2/γ + σ2 z log σ2 z + µ2 z 1}/2

and the partial derivatives are

2LVAEase(x; w, µz, σ2 z) w = 2(1 σz)µz(w(1 σz)µz x)/γ + 2wσ2 z(1 σz)2/γ,

2LVAEase(x; w, µz, σ2 z) σz = 2wµz(w(1 σz)µz x)/γ + 2w2σz(1 σz)2/γ + 2w2σ2 z(σz 1)/γ + 2σz 2/σz,

2LVAEase(x; w, µz, σ2 z) µz = 2w(1 σz)(w(1 σz)µz x)/γ + 2µz.

Sparse Autoencoders, Again?

From the KKT condition, we can obtain

w = µzx (1 σz)(σ2z + µ2z),

γ (σz(1 2σz) µ2 z) + wµzx

γ = σ 1 z σz,

µz = w(1 σz)x w2(1 σz)2 + γ .

Substituting the first equation into the last equation, we obtain

µzx (1 σz)(σ2z+µ2z)(1 σz)x

(1 σz)2(σ2z+µ2z)2 (1 σz)2 + γ

(σ2z + µ2z)2 + γ = x2

(σ2z + µ2z),

µ2 zx2 + γ(σ2 z + µ2 z)2 = x2(σ2 z + µ2 z),

γ(σ2 z + µ2 z)2 = x2σ2 z.

Finally, we compute the second equation as

(1 σz)2(σ2z + µ2z)2 (1 σz)(σz(1 2σz) µ2 z) + µzx (1 σz)(σ2z + µ2z)µzx = γ(σ 1 z σz),

(σ2z + µ2z)2 (σz(1 2σz) µ2 z) + µ2 zx2

(σ2z + µ2z) = γ(σ 1 z σz)(1 σz),

(σ2z + µ2z)2 (σz(1 2σz) µ2 z + σ2 z + µ2 z) = γ(σ 1 z σz)(1 σz),

(σ2z + µ2z)2 = γ(σ 2 z 1).

Using (σ2 z + µ2 z)2 = x2σ2 z/γ, we have

x2σ2z/γ = γ(σ 2 z 1),

µ2 z = (1 σ2 z).

Substituting this expression back into the above result, we get γ = x2σ2 z and σ2 z = γ/x2. At this point, we have calculated the unique minimum point (µ z, σ z) = ( p

γ/x2) for the one-dimensional problem. The multi-dimensional conclusion then naturally follows.

E.4. Proof of Corollary 4.8

Following the same approach as in the previous proof above, we actually only need to demonstrate that a one-dimensional problem has two local minima. The loss function for single point in SAE model is

LSAE(x; w, z) = (x wz)2 + λ1h(z) + λ2w2,

and the partial derivatives for w is

LSAE(x; w, z)

w = 2z(wz x) + 2λ2w. (19)

From the KKT condition, we can solve w = zx/(λz + z2). Then the loss function without w is

LSAE(x; z) = λ2x2

λ2 + z2 + λ1h(z).

Sparse Autoencoders, Again?

Then consider when z > 0, we have

d LSAE(x; z)

dz = 2λ2x2z (λ2 + z2)2 + λ1h (z).

The h(z) is concave and non-decreasing when z 0, so h (z) is non-increasing and positive.

Consider the function of m(z) = z/(λ2 + z2)2, we have m(0) = 0, m( p

λ2/3) = 9 3λ2/(16λ2 2) is a maxima and m(z) 0 as z . To obtain two local minima, we need the derivative of h has two intersection points with 2λ2x2z/{λ1(λ2 + z2)2}.

If h (z) = 0 when z > 0, then we know there is a local minima at z = since m(z) is decreasing when z > p

λ2/3. Because h(|z|) is concave, we know that h(0) h(1), leading to another local minima z = 0 while z = 0 is also the local minima for m(z).

If limz 0 h (z) > 0, then we are sure that z = 0 is a local minima for any x. Then we just need 9 3λ2x2/(16λ2) > λ1h ( p

λ2/3), leading to a interval [ p

λ2/3 + c] where LSAE(x; z) is decreasing with some positive c. In other words, there must be a local minima satisfying z > p

E.5. Linear VAE on Composed Linear Dataset

For simplicity, we omit the subindex in LLVAE( ) and write its loss function as,

L(ϕ, θ, γ) = Z

Eqϕ(z|x) [log pθ(x|z)] + KL(qϕ(z|x)||p(z)) ω(dx), (20)

where qϕ(z|x) = N(z|µz, Σz) with µz Rd, pθ(x|z) = N(x|µx, γI) with µx Rκ and p(z) = N(0, I). Here

Σz def = diag{σ2 z(x; ϕ)} is a diagonal matrix.

Recall that the decoder here is linear, i.e., ˆx = µx = Wz. Expand the formula in the integral as following

L(x; ϕ, θ, γ) Eqϕ(z|x) [log pθ(x|z)] + KL(qϕ(z|x) p(z))

2 log(2πγ) x µx 2

2 tr(Σz) log |Σz| + µ z µz κ

2 log(2πγ) + Eqϕ(z|x)

2 tr(Σz) log |Σz| + µ z µz κ

2 log(2πγ) + Eqϕ(z|x)

2 tr(Σz) log |Σz| + µ z µz κ

2 log(2πγ) + 1

2γ tr(WΣz W ) + (Wµz x) (Wµz x)

2 tr(Σz) log |Σz| + µ z µz κ .

The gradient of L(x) takes the form,

γ W (Wµz x) + µz

2(I Σ 1 z )

γ {Σz W + µzµ z W µzx }.

Note that µz and Σ1/2 z are varying functions of x, then the KKT conditions provide the necessary conditions for an optimal

Sparse Autoencoders, Again?

solution as

(W W + γI)µz = W x,

Σ 1 z = W W/γ + I, Z

X (Σz + µzµ z )W ω(dx) = Z

X µzx ω(dx). (21)

Here, the first two equations are held for any x and the last equation is held on integral since W does not vary with x. For convenience, let (γI + W W) 1 = A, then Σz = γA and µz = AW x. Then (21) could be transformed as Z

X (Σz + µzµ z )W ω(dx) = Z

X µzx ω(dx), Z

X (Σz + µzµ z )W WAω(dx) = Z

X µzx WAω(dx), Z

X (Σz + µzµ z )W Wω(dx) = Z

X µzµ z ω(dx)(γI + W W),

Σz W W = γ Z

X µzµ z ω(dx),

X AW xx WAω(dx),

W W(γI + W W) = W Z

X xx ω(dx)W,

W (γI + WW )W = W Z

X xx ω(dx)W. (22)

Denote the singular value decomposition of W as W = USV with U Rd κ, S Rκ κ being a diagonal matrix, and V Rκ κ. Then (22) can be simplified as

VSU (γI + US2U )USV = VSU Z

X xx ω(dx)USV

Note that we have V V = Iκ, and U U = Id, then we have

γS2 + S4 = SU Z

X xx ω(dx)US. (23)

Notice the left side is a diagonal matrix, thus U R

X xx ω(dx)U Σ2 x should also be diagonal. Since x is lying in a composed linear space, then the rank of R xx ω(dx) must be the dimension of smallest linear space covering all samples.

Denote the dimension by r, and assume the first r diagonal element in Σ2 x is nonzero. Finally, we obtain that Sii = p

σ2 xi γ when i r and Sii = 0 when i > r, where recall Sii is the i-th diagonal element in S and σ2 xi is the short of Σ2 x,ii. Now the interim matrix A is Vdiag{σ 2 x1 , . . . , σ 2 xr , γ 1, . . . , γ 1}V , and Σz = Vdiag{γσ 2 x1 , . . . , γσ 2 xr , 1, . . . , 1}V . So the number of active dimensions is r.

To compute the loss function, we also need Z

X Wµz x 2 2ω(dx) = Z

X (WAW I)x 2 2ω(dx)

X Udiag{ γσ 2 x1 , . . . , γσ 2 xr , 1, . . . , 1}U x 2 2ω(dx)

= tr diag{γ2σ 2 x1 , . . . , γ2σ 2 xr , 1, . . . , 1}U Z

X xx ω(dx)U

i r σ 2 xi ,

Sparse Autoencoders, Again?

X µz 2 2ω(dx) = Z

X AW x 2 2ω(dx) = Z

X x WA2W xω(dx)

X x Udiag{(σ2 x1 γ)/σ4 x1, . . . , (σ2 xr γ)/σ4 xr, 0, . . . , 0}U xω(dx)

= tr diag{(σ2 x1 γ)/σ4 x1, . . . , (σ2 xr γ)/σ4 xr, 0, . . . , 0}U Z

X xx ω(dx)U

i r (σ2 xi γ)/σ2 xi.

With these results, the minimum of energy is

L(ϕ γ, θ γ, γ) = Z

d 2 log(2πγ) + 1

i r (σ2 xi γ)/σ2 xi + Wµz x 2 2

i r γ/σ2 xi + κ r X

i r (log γ log σ2 xi) + µz 2 2 κ

2 log(2πγ) + 1

i r (σ2 xi γ)/σ2 xi + 1

i r σ 2 xi r r log γ + X

i r log σ2 xi + r γ X

i r σ 2 xi }

2 log(2πγ) r

2 log γ + 1

i r log σ2 xi + 1

As γ 0, the rate of minimum energy is (d r)/2 log γ + O(1).