# principal_bit_analysis_autoencoding_with_schurconcave_loss__4a8c3133.pdf Principal Bit Analysis: Autoencoding with Schur-Concave Loss Sourbh Bhadane 1 Aaron B. Wagner 1 Jayadev Acharya 1 We consider a linear autoencoder in which the latent variables are quantized, or corrupted by noise, and the constraint is Schur-concave in the set of latent variances. Although finding the optimal encoder/decoder pair for this setup is a nonconvex optimization problem, we show that decomposing the source into its principal components is optimal. If the constraint is strictly Schur-concave and the empirical covariance matrix has only simple eigenvalues, then any optimal encoder/decoder must decompose the source in this way. As one application, we consider a strictly Schur-concave constraint that estimates the number of bits needed to represent the latent variables under fixed-rate encoding, a setup that we call Principal Bit Analysis (PBA). This yields a practical, general-purpose, fixed-rate compressor that outperforms existing algorithms. As a second application, we show that a prototypical autoencoder-based variable-rate compressor is guaranteed to decompose the source into its principal components. 1. Introduction Autoencoders are an effective method for representation learning and dimensionality reduction. Given a centered dataset x1, x2, . . . , xn Rd (i.e., P i xi = 0), an autoencoder (with latent dimension k d) consists of an encoder f : Rd 7 Rk and a decoder g : Rk 7 Rd. The goal is to select f and g from prespecified classes Cf and Cg such that if a random point x is picked from the data set then g(f(x)) is close to x in some sense, for example in mean squared error. If Cf and Cg consist of linear mappings then the autoencoder is called a linear autoencoder. Autoencoders have achieved striking successes when f 1Cornell University. Correspondence to: Sourbh Bhadane . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). and g are selected through training from the class of functions realized by multilayer perceptrons of a given architecture (Hinton & Salakhutdinov, 2006). Yet, the canonical autoencoder formulation described above has a notable failing, namely that for linear autoencoders, optimal choices of f and g do not necessarily identify the principal components of the dataset; they merely identify the principal subspace (Bourlard & Kamp, 1988; Baldi & Hornik, 1989). That is, the components of f(x) are not necessarily proportional to projections of x against the eigenvectors of the covariance matrix i=1 xi x i , (1) which we assume without loss of generality is full rank. Thus, linear autoencoders do not recover Principal Component Analysis (PCA). The reason for this is that both the objective (the distortion) and the constraint (the dimensionality of the latents) are invariant to an invertible transformation applied after the encoder with its inverse applied before the decoder. It is desirable for linear autoencoders to recover PCA for two reasons. First, from a representation learning sandpont, it guarantees that the autoencoder recovers uncorrelated features. Second, since a conventional linear autoencoder has a large number of globally optimal solutions corresponding to different bases of the principal subspace, it is preferable to eliminate this indeterminism. Autoencoders are sometimes described as compressing the data (Bishop, 2006; Bourlard & Kamp, 1988; Liao et al., 2021; do Esp ırito Santo, 2012), even though f can be invertible even when k < d. We show that by embracing this compression-view, one can obtain autoencoders that are able to recover PCA. Specifically, we consider linear autoencoders with quantized (or, equivalently, noisy) latent variables with a constraint on the estimated number of bits required to transmit the quantized latents under fixed-rate coding. We call this problem Principal Bit Analysis (PBA). The constraint turns out to be a strictly Schur-concave function of the set of variances of the latent variables (see the supplementary for a review of Schur-concavity). Although finding the optimal f and g for this loss function is a nonconvex optimization problem, we show that for any strictly Schur-concave loss function, an optimal f must send projec- Principal Bit Analysis tions of the data along the principal components, assuming that the empirical covariance matrix of the data has only simple eigenvalues. That is, imposing a strictly Schur-concave loss in place of a simple dimensionality constraint suffices to ensure recovery of PCA. The idea is that the strict concavity of the loss function eliminates the rotational invariance described above. As we show, even a slight amount of curvature in the constraint forces the autoencoder to spread the variances of the latents out as much as possible, resulting in recovery of PCA. If the loss function is merely Schurconcave, then projecting along the principal components is optimal, but not necessarily uniquely so. Using this theorem, we can efficiently solve PBA. We validate the solution experimentally by using it to construct a fixed-rate compression algorithm for arbitrary vector-valued data sources. We find that the PBA-derived compressor beats existing linear, fixed-rate compressors both in terms of mean squared error, for which it is optimized, and in terms of the structural similarity index measure (SSIM) and downstream classification accuracy, for which it is not. A number of variable-rate multimedia compressors have recently been proposed that are either related to, or directly inspired by, autoencoders (Tschannen et al., 2018; Toderici et al., 2017; Ball e et al., 2016; Toderici et al., 2016; Theis et al., 2017; Rippel & Bourdev, 2017; Habibian et al., 2019; Agustsson et al., 2017; Ball e et al., 2018; Zhou et al., 2018; Agustsson et al., 2019; Ball e et al., 2021). As a second application of our result, we show that for Gaussian sources, a linear form of such a compressor is guaranteed to recover PCA. Thus we show that ideas from compression can be fruitfully fed back into the original autoencoder problem. The contributions of the paper are We propose a novel linear autoencoder formulation in which the constraint is Schur-concave. We show that this generalizes conventional linear autoencoding. If the constraint is strictly Schur-concave and the covariance matrix of the data has only simple eigenvalues, then we show that the autoencoder provably recovers PCA, providing a new remedy for a known limitation of linear autoencoders. We use the new linear autoencoder formulation to efficiently solve a fixed-rate compression problem that we call Principal Bit Analysis (PBA). We demonstate experimentally that PBA outperforms existing fixed-rate compressors on a variety of data sets and metrics. We show that a linear, variable-rate compressor that is representative of many autoencoder-based compressors in the literature effectively has a strictly Schur-concave loss, and therefore it recovers PCA. Related Work. Several recent works have examined how linear autoencoders can be modified to guarantee recovery of PCA. Most solutions involve eliminating the invariant global optimal solutions by introducing regularization of some kind. (Oftadeh et al., 2020) propose a loss function which adds k penalties to recover the k principal directions, each corresponding to recovering up to the first i k principal directions. (Kunin et al., 2019) show that ℓ2 regularization helps reduce the symmetry group to the orthogonal group. (Bao et al., 2020) further break the symmetry by considering non-uniform ℓ2 regularization and deterministic dropout. (Ladjal et al., 2019) consider a nonlinear autoencoder with a covariance loss term to encourage finding orthogonal directions. Recovering PCA is an important problem even in the stochastic counterpart of autoencoders. (Lucas et al., 2019) analyze linear variational autoencoders (VAEs) and show that the global optimum of its objective is identical to the global optimum of log marginal likelihood of probabilistic PCA (p PCA). (Rol ınek et al., 2019) analyze an approximation to the VAE loss function and show that the linear approximation to the decoder is orthogonal. Our result on variable-rate compressors is connected to the sizable recent literature on compression using autoencoderlike architectures. Representative contributions to the literature were noted above. Those works focus mostly on the empirical performance of deep, nonlinear networks, with a particular emphasis on finding a differentiable proxy for quantization so as to train with stochastic gradient descent. In contrast, this work considers provable properties of the compressors when trained perfectly. Learned, neural fixedrate compressors have been considered in (Li et al., 2018; Toderici et al., 2016). However, we don t compare against these since ours is a linear scheme. Notation. We denote matrices by bold capital letters e.g. M, and vectors by bold small, e.g. v. The jth column of a matrix M is denoted by mj and the jth entry of a vector v by [v]j. We denote the set {1, 2, d} by [d]. A sequence a1, a2, an is denoted by {ai}n i=1. We denote the zero column by 0. Logarithms without specified bases denote natural logarithms. Organization. The balance of the paper is organized as follows. We describe our constrained linear autoencoder framework in Section 2. This results in an optimization problem that we solve for any Schur-concave constraint in Section 2.1. In Section 3, we recover linear autoencoders and PBA under our framework. We apply the PBA solution to a problem in variable-rate compression of Gaussian sources in Section 4. Section 5 contains experiments comparing the performance of the PBA-based fixed-rate compressor against existing fixed-rate linear compressors on Principal Bit Analysis image and audio datasets. Complete proofs of all theorems can be found in the supplementary. 2. Linear Autoencoding with a Schur-Concave Constraint Throughout this paper we consider Cf and Cg to be the class of linear functions. The functions f and g can then be represented by d-by-d matrices, which we denote by W and T , respectively. Thus, we have f(x) = W x, and g(x) = T x. (2) We wish to design W and T to minimize the mean squared error when the latent variables W x are quantized, subject to a constraint on the number of bits needed to represent the quantized latents. We accomplish this via two modifications to the canonical autoencoder. First, we perturb the d latent variables with zero-mean additive noise with covariance matrix σ2I, which we denote by ε. Thus, the input to the decoder is W x + ε (3) and our objective is to minimize the mean squared error i=1 Eε h xi T W xi + ε 2 This is equivalent to quantizing the latents, in the following sense. Let Q( ) be the function that maps any real number to its nearest integer and ε be a random variable unformly distributed over[ 1/2, 1/2]. Then for X independent of ε, the quantities Q(X + ε) ε and X + ε have the same distribution (Zamir & Feder, 1992). Thus (4) is exactly the mean squared error if the latents are quantized to the nearest integer and σ2 = 1 12, assuming that the quantization is dithered. The overall system is depicted in Fig. 1. We wish to constrain the number of bits needed to describe the latent variables. We assume that the jth quantized latent is clipped to the interval (2a)2w j Kwj + 1 (2a)2w j Kwj + 1 where a > 0 is a hyperparameter and the covariance matrix K is as defined in (1). The idea is that for sufficiently large a, the interval a q w j Kwj, a q contains the latent with high probability, and adding 1 accounts for the expansion due to the dither. The number of bits needed for the jth latent is then 4a2w j Kwj + 1 = 1 2 log 4a2w j Kwj + 1 . We arrive at our optimization problem: inf W ,T 1 n i=1 Eε h xi T W xi + ε 2 subject to R 1 2 log 4a2w i Kwi + 1 . Note that the function {w j Kwj}d j=1 7 1 2 log 4a2w i Kwi + 1 is strictly Schur-concave (see supplementary for a brief review of Schur-concavity). Our first result only requires that the constraint is Schur-concave in the set of latent variances, so we will consider the more general problem inf W ,T 1 n i=1 Eε h xi T W xi + ε 2 subject to R ρ w j Kwj d where ρ( ) is any Schur-concave function. Since T does not appear in the rate constraint, the optimal T can be viewed as the Linear Least Squares Estimate (LLSE) of a random x given W x + ε. Therefore, the optimal decoder, T for a given encoder W is (e.g. (Kay, 1998)): T = KW (W KW + σ2I) 1. (7) Substituting for T in (6) yields an optimization problem over only W . Using standard linear algebra, we rewrite the objective in terms of K and W as inf W tr(K) tr(KW (W KW + σ2I) 1W K) subject to R ρ w j Kwj d This problem is nonconvex in general. In the following subsection, we prove a structural result about the problem for a Schur-concave ρ. Namely, we show that the nonzero rows of W must be eigenvectors of K. In Section 3, we solve the problem for the specific choice of ρ in (5). We also show how this generalizes conventional autoencoders. 2.1. Optimal Autoencoding with a Schur-Concave Constraint The following is the main theoretical result of the paper. Principal Bit Analysis Linear Encoder (W) Linear Decoder (T) T W xi + ε xi W xi W xi + ε Figure 1. Compression Block Diagram Theorem 1. For Schur-concave ρ : Rd 0 R 0 and R > 0, the set of matrices whose nonzero columns are eigenvectors of the covariance matrix K is optimal for (8). If ρ is strictly Schur-concave and K contains distinct eigenvalues, this set contains all optimal solutions of (8). Proof Sketch. Let the eigenvalues of K be {σ2 i }d i=1 with σ2 1 σ2 2 . . . σ2 d, and let K = UΣU where Σ is a diagonal matrix with nonincreasing diagonal entries and U is an orthogonal matrix whose columns are the corresponding normalized eigenvectors of K. We first prove that the optimal value of (8) can be achieved by a W such that W KW is a diagonal matrix. Indeed, for a given W , consider f W = W Q where W KW = QΛQ. Under this transformation, the objective is unchanged; but since the eigenvalues of W KW majorize its diagonal elements, the rate constraint improves. Using this fact, we prove an upper bound on the objective that can be achieved by choosing the nonzero columns of W as eigenvectors of K. We then prove that for a strictly Schurconcave ρ and for K whose eigenvalues are distinct, this is the only way to attain the upper bound. As a consequence of Theorem 1, encoding via an optimal W can be viewed as a projection along the eigenvectors of K, followed by different scalings applied to each component, i.e., W = US where S is a diagonal matrix with entries si 0 and U is the normalized eigenvector matrix. Only S remains to be determined, and to this end, we may assume that K is diagonal with the nonincreasing diagonal entries, implying U = I. In subsequent sections, our choice of ρ will be of the form d P i=1 ρsl, where ρsl : R 0 R 01 is (strictly) concave, making ρ (strictly) Schur-concave. Therefore, (8) reduces to inf S tr(K) tr(KS(S KS + σ2I) 1S K) subject to R ρsl {s2 i σ2 i } , (9) where the infimum is over diagonal matrices S. To handle situations for which lims ρsl(s) < , we allow the diagonal entries of S to be , with the objective for such cases defined via its continuous extension. 1 sl stands for single-letter In the next section, we will solve (9) for several specific choices of ρsl. 3. Explicit Solutions: Conventional Linear Autoencoders and PBA 3.1. Conventional Linear Autoencoders Given a centered dataset x1, x2, xn Rd, consider a linear autoencoder optimization problem where the encoder and decoder, W and T , respectively, are d-by-k matrices where k d is a parameter. The goal is to minimize the mean squared error as given by (4). PCA corresponds to the global optimal solution of this optimization problem, where W = T = Uk, where Uk Rd k is a matrix whose columns are the k eigenvectors corresponding to the k largest eigenvalues of K. However, there are multiple global optimal solutions, given by any encoder-decoder pair of the form (Uk V , Uk V ), where V is an orthogonal matrix (Baldi & Hornik, 1989). We now recover linear autoencoders through our framework in Section 2. Consider the optimization problem in (9) where ρsl : R 0 {0, 1} is a concave function defined as ρsl(x) = 1 [x > 0] . (10) Note that this penalizes the dimension of the latents, as desired. Note also that this cost is Schur-concave but not strictly so. The fact that PCA solves the conventional linear autoencoding, but is not necessarily the unique, solution, follows immediately from Theorem 1. Theorem 2. If ρsl( ) is given by (10), then an optimal solution for (9) is given by a diagonal matrix S whose top min( R , d) diagonal entries are equal to and the remaining diagonal entries are 0. Proof. Let F def = {i [d] : si > 0}, implying |F| R. Since K and S are diagonal, the optimization problem in (9) can be written as j [d]\F σ2 j + X σ2σ2 ℓ σ2 + σ2 ℓs2 ℓ subject to R i=1 1 [si > 0] . Principal Bit Analysis Since the value of sℓ, ℓ F does not affect the rate constraint, each of the sℓcan be made as large as possible without changing the rate constraint. Therefore, the infimum value of the objective is P j [d]\F σ2 j . Since we seek to mini- mize the distortion, the optimal F is the set of indices of the largest |F| eigenvalues. Since the number of these eigenvalues cannot exceed R, we choose |F| = min( R , d). Unlike the conventional linear autoencoder framework, in Section 2, the latent variables W x are quantized, which we model with additive white noise of fixed variance. Therefore, an infinite value of si indicates sending u i x with full precision where ui is the eigenvector corresponding to the ith largest eigenvalue. This implies that PCA with parameter k corresponds to W = US, where S is a diagonal matrix whose top k diagonal entries are equal to and the d k remaining diagonal entries are 0. Therefore, for any R such that R = k, an optimal solution to (9) corresponds to linearly projecting the data along the top k eigenvectors, which is the same as PCA. Note that, like (Baldi & Hornik, 1989), we only prove that projecting along the eigenvectors is one of possibly other optimal solutions. However, even a slight amount of curvature in ρ would make it strictly Schur-concave, thus recovering the principal directions. We next turn to a specific cost function with curvature, namely the PBA cost function that was our original motivation. 3.2. Principal Bit Analysis (PBA) Consider the choice of ρsl : R 0 R 0 that provided the original impetus for Theorem 1. For γ > 2 σ2 , 2 log(γx + 1). (12) The nature of the optimization problem depends on the value of γ. For 1 γσ2 2, the problem can be made convex with a simple change of variable. For γσ2 = 1, the problem coincides with the classical waterfilling procedure in rate-distortion theory, in fact. For γσ2 > 2, the problem is significantly more challenging. Since we are interested in relatively large values of γ for our compression application (see Section 5 to follow), we focus on the case γ > 2/σ2. Theorem 3. If ρsl( ) is given by (12), then for any λ > 0, the pair Ropt, Dopt obtained from the output of Algorithm 1 satisfies Dopt + λ Ropt = inf S tr(K) tr(KS(S KS + σ2I) 1S K) i=1 ρsl {s2 i σ2 i } , (14) Algorithm 1 Principal Bit Analysis (PBA) Require: λ > 0, α = γσ2 > 2, σ2 1 0 0 0 σ2 2 0 ... ... ... ... 0 0 σ2 d such that σ2 1 σ2 2 σ2 d. 1: If λ σ2 1/(4(α 1)), Output Ropt = 0, Dopt = Pd i=1 σ2 i . 2: Set d = max i : λ < σ2 i /4(α 1) . 3: Set R, D to zero arrays of size 2 d. 4: for r 1, 2, d do 5: D(2r 1) = r P σ2 i 2(α 1) (1 ci) + d P 6: R(2r 1) = r P 1 2 log σ2 i 4λ + log (1 + ci). 7: D(2r) = r 1 P σ2 i 2(α 1) (1 ci) + σ2 r 2(α 1) (1 + cr) + 8: R(2r) = r P 1 2 log σ2 i 4λ + r 1 P i=1 log (1 + ci) + log (1 cr). 9: end for 10: r arg minj [2 d] D(j) + λ R(j). 11: Output Ropt = R(r ), Dopt = D(r ). Note that by sweeping λ > 0, one can compute the lower convex envelope of the (D, R) curve. Since every Pareto optimal (D, R) must be a stationary point of (14), one can also use Algorithm 1 to compute the (D, R) curve itself by sweeping λ and retaining all those stationary points that are not Pareto dominated. 4. Application to Variable-Rate Compression We have seen that an autoencoder formulation inspired by data compression succeeds in providing guaranteed recovery the principal source components. Conversely, a number of successful multimedia compressors have recently been proposed that are either related to, or directly inspired by, autoencoders (Ball e et al., 2021; 2016; Toderici et al., 2017; 2016). In particular, Ball e et al. (Ball e et al., 2018) show that the objective minimized by their compressor coincides with that of variational autoencoders. Following (Ball e et al., 2021), we refer to this as the nonlinear transform coding (NTC) objective. We next use Theorem 1 to show that any minimizer of the NTC objective is guaranteed to recover the principal source components if (1) the source is Gaussian Principal Bit Analysis distributed, (2) the transforms are restricted to be linear, and (3) the entropy model is factorized, as explained below. Let x N (0, K), where K is a positive semidefinite covariance matrix. As before, we consider an autoencoder defined by its encoder-decoder pair (f, g), where for k d, f : Rd Rk and g : Rk Rd are chosen from prespecified classes Cf and Cg. The NTC framework assumes dithered quantization (Agustsson & Theis, 2020; Choi et al., 2019) as in Section 2, and seeks to minimize the Lagrangian inf f Cf ,g Cg Ex,ε h x g (Q (f(x) + ε) ε) 2 2 i + λH (Q (f(x) + ε) ε|ε) , where λ > 0 and ε has i.i.d. Unif [ 0.5, 0.5] components. NTC assumes variable-length compression, and the quantity H (Q (f(x) + ε) ε|ε) is an accurate estimate of minimum expected codelength of the discrete random vector Q (f(x) + ε). As we noted in Section 2, (Zamir & Feder, 1992) showed that for any random variable x, Q (x + ε) ε and x + ε have the same joint distribution with x. They also showed that H (Q (x + ε) ε|ε) = I (x + ε; x)) = h(x + ε), where h( ) denotes differential entropy. Therefore, the objective can be written as inf f Cf ,g Cg Ex,ε h x g (f (x) + ε) 2 2 i + λh (f (x) + ε) . (15) (Compare eq.(13) in (Ball e et al., 2021)). We consider the case where Cf, Cg are the class of linear functions. Let W , T be d-by-d matrices. Define f (x) = W x, g (x) = T x. Substituting this in the above equation, we obtain inf W ,T Ex,ε h x T W x + ε 2 + λh W x + ε . (16) Since T does not appear in the rate constraint, the optimal T can be chosen to be the minimum mean squared error estimator of x N (0, K) given W x+ε, as in Section 2. This gives inf W tr(K) tr(KW W KW + I + λh W x + ε . (17) As noted earlier, the rate term h W x + ε is an accurate estimate for the minimum expected length of the compressed representation of Q W x + ε . This assumes that the different components of this vector are encoded jointly. However, in practice, one often encodes them separately, relying on the transform W to eliminate redundancy among the components. Accordingly, we replace the rate term with d X i=1 h w i x + [ε]i , to arrive at the following optimization problem inf W tr(K) tr(KW W KW + I i=1 h w i x + [ε]i . (18) Theorem 4. Suppose K has distinct eigenvalues. Then any W that achieves the infimum in (18) has the property that all of its nonzero rows are eigenvectors of K. Proof. Since the distribution of ε is fixed, by the Gaussian assumption on x, h w j x + [ε]j only depends on wj through w j Kwj. Thus we may write h(w j x + ϵ) = ρsl(w j Kwj). (19) By Theorem 1, it suffices to show that ρsl( ) is strictly concave. Let Z be a standard Normal random variable and let ϵ be uniformly distributed over [ 1/2, 1/2], independent of Z. Then we have ρsl(s) = h( s Z + ϵ). (20) Thus by de Bruijn s identity (Cover & Thomas, 2006), ρ sl(s) = 1 2J(ϵ + s Z), (21) where J( ) is the Fisher information. To show that ρ sl( ) is strictly concave, it suffices to show that J(ϵ + s Z) is strictly decreasing in s.2 To this end, let t > s > 0 and let Z1 and Z2 be i.i.d. standard Normal random variables, independent of ϵ. Then t Z) = J(ϵ + s Z1 + t s Z2) (22) and by the convolution inequality for Fisher information (Blachman, 1965), 1 J(ϵ + s Z1 + t s Z2) > 1 J(ϵ + s Z1) + 1 J( t s Z2) > 1 J(ϵ + s Z1), 2If g ( ) is strictly decreasing then for all t > s, g(t) = g(s) + R t s g (u)du < g(s) + g (s)(t s) and likewise for t < s. That g( ) is strictly concave then follows from the standard first-order test for concavity (Boyd & Vandenberghe, 2004). Principal Bit Analysis where the first inequality is strict because ϵ + s Z1 is not Gaussian distributed. 5. Compression Experiments We validate the PBA algorithm experimentally by comparing the performance of a PBA-derived fixed-rate compressor against the performance of baseline fixed-rate compressors. The code of our implementation can be found at https://github.com/Sourbh Bh/PBA. Although variable-rate codes are more commonplace in practice, fixedrate codes do offer some advantages over their more general counterparts: 1. In applications where a train of source realizations will be compressed sequentially, fixed-rated coding allows for simple concatenation of the compressed representations, which also helps in maintaining synchrony. 2. In applications where a dataset of source realizations are individually compressed, fixed-rate coding allows for random access of the data points from the compressed representation. 3. In streaming, bandwidth provisioning is simplified when the bit-rate is constant over time. Fixed-rate compressors exist for specialized sources such as speech (Mc Cree & Barnwell, 1995; Schroeder & Atal, 1985) and audio more generally (Vor). We consider a generalpurpose, learned, fixed-rate compressor derived from PBA and the following two quantization operations. The first, QCD(a, σ2, U, x)3 accepts the hyperparameter a, a variance estimate σ2, a dither realization U, and the scalar source realization to be compressed, x, and outputs (a binary representation of) the nearest point to x in the set i + U : i Z and i + U Γ where Γ = 2 1 2 log2(4a2σ2+1) . (25) This evidently requires log2 Γ bits. The second function, Q CD(a2, σ2, U, b), where b is a binary string of length log2 Γ, maps the binary representation b to the point in (24). These quantization routines are applied separately to each latent component. The σ2 parameters are determined during training. The dither U is chosen uniformly over the set [ 1/2, 1/2], independently for each component. We assume that U is chosen pseudorandomly from a fixed seed that is known to both the encoder and the decoder. For our experiments, we fix the a parameter at 15 and hard code this in both the encoder and the decoder. We found that this 3 CD stands for clamped dithered. choice balances the dual goals of minimizing the excess distortion due to the clamping quantized points to the interval (Γ/2, Γ/2] while minimizing the rate. PBA compression proceeds by applying Algorithm 1 to a training set to determine the matrices W and T . The variance estimates σ2 1, . . . , σ2 d for the d latent variances are chosen as the empirical variances on the training set and are hard-coded in the encoder and decoder, as is the parameter a2. Given a data point x, the encoded representation is the concatenation of the bit strings b1, . . . , bd, where bi = QCD(a, σ2 i , Ui, w i x), The decoder parses the received bits into b1, . . . , bd. and computes the latent reconstruction ˆy, where ˆyi = Q CD(a2, σ2 i , Ui, bi), The reconstruction is then T ˆy. We evaluate the PBA compressor on MNIST (Le Cun et al., 1998), CIFAR-10 (Krizhevsky, 2009), MIT Faces Dataset (Fac), Free Spoken Digit Dataset (FSDD) (Jackson). We compare our algorithms mainly using mean-squared error since our theoretical analysis uses mean squared error as the distortion metric. Our plots display Signal-to-Noise ratios (SNRs) for ease of interpretation. For image datasets, we also compare our algorithms using the Structural Similarity (SSIM) or the Multi-scale Structural Similarity (MS-SSIM) metrics when applicable (Wang et al., 2004). We also consider errors on downstream tasks, specifically classification, as a distortion measure. We plot results from only selected datasets here and defer the rest to the supplementary. For all datasets, we compare the performance of the PBA compressor against baseline scheme derived from PCA. The PCA-based scheme sends some of the principal components essentially losslessly, and sends no information about the others. Specifically, for any given k, we choose the first k columns of W to be aligned with the first k principal components of the dataset; the remaining columns are zero. Each nonzero column is scaled such that projections on the column contain all significant digits. This is done so that at high rates, the quantization procedure sends the k principal components losslessly. The quantization and decoder operations are as in the PBA-based scheme; in particular the a parameter is as specified above. By varying k, we trade off rate and distortion. 5.1. SNR Performance We examine compression performance under mean squared error, or equivalently, the SNR, defined as SNR = 10 log10 Principal Bit Analysis Figure 2. Reconstructions at different bits/pixel values for PCA (top) and PBA (bottom) Figure 3. Plot of SNR/pixel vs Rate (bits/pixel). Left to right: CIFAR-10, FSDD, CIFAR-10, MNIST. In the last two figures reconstructions are not rounded to integers from 0 to 255. All figures are zoomed-in. Zoomed-out versions are in the supplementary. Figure 4. Number of components sent vs rate (bits/pixel). Left: CIFAR-10, Right: MNIST. where P is the empirical second moment of the dataset. PBA (and PCA) is designed to minimize this objective. In Figure 2, we display reconstructions for a particular image in the Faces Dataset under PBA and PCA. The first two figures in Figure 3 show the tradeoff for PBA and PCA against JPEG and JPEG2000 (for CIFAR-10) and AAC (for FSDD). All of the image datasets have integer pixel values between 0 and 255. Accordingly, we round the reconstuctions of PBA and PCA to the nearest integer in this range. The last two figures in Figure 3 shows the same tradeoff for PBA and PCA when reconstructions are not rounded off to the nearest integer. We see that PBA consistently outperforms PCA and JPEG, and is competitive with JPEG2000, even though JPEG and JPEG2000 are variable-rate. 4 We estimate the size of the JPEG header by compressing an empty header and subtract this estimate from all the compression sizes produced by JPEG. For audio data, we observe that PBA consistently outperforms PCA and AAC. Since the image data all use 8 bits per pixel, one can obtain infinite SNR at this rate via the trivial encoding that communicates the raw bits. PCA and PBA do not find this solution because they quantize in the transform domain, where lattice-nature of the pixel distribution is not apparent. Determining how to leverage lattice structure in the source distribution for purposes of compression is an interesting question that transcends the PBA and PCA algorithms and that we will not pursue here. PCA performs poorly because it favors sending the less significant bits of the most significant components over the most significant bits of less significant components, when the latter are more valuable for reconstructing the source. Arguably, it does not identify the principal bits. Figure 4 shows the number of distinct components about which information is sent as a function of rate for both PBA and 4It should be noted, however, that JPEG and JPEG2000 aim to minimize subjective distortion, not MSE, and they do not allow for training on sample images, as PBA and PCA do. A similar caveat applies to AAC. Principal Bit Analysis PCA. We see that PBA sends information about many more components for a given rate than does PCA. 5.2. SSIM Performance Structural similarity (SSIM) and Multi-Scale Structural similarity (MS-SSIM) are metrics that are attuned to perceptual similarity. Given two images, the SSIM metric outputs a real value between 0 and 1 where a higher value indicates more similarity between the images. We evaluate the performance of our algorithms on these metrics as well in Figure 5. We see that PBA consistently dominates PCA, and although it was not optimized for this metric, beats JPEG at low rates as well. Figure 5. SSIM vs Rate (bits/pixel). Left: CIFAR-10, Right: MNIST. 5.3. Performance on Downstream tasks Figure 6. Accuracy vs Rate (bits/pixel). Left: CIFAR-10, Right: MNIST. Lastly, we compare the impact of using PBA and PCA on an important downstream task, namely classification performance. We evaluate the algorithms on MNIST and CIFAR-10 datasets and use neural networks for classification. Our hyperparameter and architecture choices are in the supplementary. We divide the dataset into three parts. From the first part, we obain the covariance matrix that we use for PCA and the PBA compressor. The second and third part are used as training and testing data for the purpose of classification. For a fixed rate, reconstructions are passed to the neural networks for training and testing respectively. Since our goal is to compare classification accuracy across the compressors, we fix both, the architecture and hyperparameters, and don t perform any additional tuning for the algorithms separately. Figure 6 shows that PBA outperforms PCA in terms of accuracy. The difference is especially significant for low rates and all algorithms attain roughly the same performance at higher rates. 6. Acknowledgements This research was supported by the US National Science Foundation under grants CCF-2008266, CCF-1934985, CCF-1617673, CCF-1846300, CCF-1815893 and the US Army Research Office under grant W911NF-18-1-0426. Faces dataset. https://courses.media.mit.edu/ 2004fall/mas622j/04.projects/faces/. Accessed: 2021-02-03. Vorbis audio compression. https://xiph.org/ vorbis/. Accessed: 2021-01-26. Agustsson, E. and Theis, L. Universally Quantized Neural Compression. In Advances in Neural Information Processing Systems 33, pp. 12367 12376, 2020. Agustsson, E., Mentzer, F., Tschannen, M., Cavigelli, L., Timofte, R., Benini, L., and Gool, L. V. Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations. In Advances in Neural Information Processing Systems 30, pp. 1141 1151, 2017. Agustsson, E., Tschannen, M., Mentzer, F., Timofte, R., and Gool, L. V. Generative Adversarial Networks for Extreme Learned Image Compression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 221 231, 2019. Baldi, P. and Hornik, K. Neural networks and Principal Component Analysis: Learning from Examples Without Local Minima. Neural Networks, 2(1):53 58, 1989. Ball e, J., Laparra, V., and Simoncelli, E. End-to-end Optimization of Nonlinear Transform Codes for Perceptual Quality. In 2016 Picture Coding Symposium (PCS), 2016. Ball e, J., Minnen, D., Singh, S., Hwang, S. J., and Johnston, N. Variational Image compression with a Scale Hyperprior. In International Conference on Learning Representations, 2018. Ball e, J., Chou, P. A., Minnen, D., Singh, S., Johnston, N., Agustsson, E., Hwang, S. J., and Toderici, G. Nonlinear Transform Coding. IEEE Journal of Selected Topics in Signal Processing, 15(2):339 353, 2021. doi: 10.1109/ JSTSP.2020.3034501. Bao, X., Lucas, J., Sachdeva, S., and Grosse, R. B. Regularized Linear Autoencoders Recover the Principal Components, Eventually. In Advances in Neural Information Processing Systems, volume 33, pp. 6971 6981, 2020. Principal Bit Analysis Bishop, C. M. Pattern Recognition and Machine Learning . Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 0387310738. Blachman, N. M. The convolution inequality for entropy powers. IEEE Trans. Inf. Theory, 11(2):267 271, April 1965. Bourlard, H. and Kamp, Y. Auto-association by Multilayer Perceptrons and Singular Value Decomposition. Biological Cybernetics, 59:291 294, 1988. Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge, 2004. Choi, Y., El-Khamy, M., and Lee, J. Variable rate deep image compression with a conditional autoencoder. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. Cover, T. M. and Thomas, J. A. Elements of Information Theory. Wiley, 2006. do Esp ırito Santo, R. Principal Component Analysis Applied to Digital Image Compression. Einstein (S ao Paulo), 10:135 139, June 2012. ISSN 1679-4508. Habibian, A., Rozendaal, T. v., Tomczak, J. M., and Cohen, T. S. Video Compression with Rate-Distortion Autoencoders. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7033 7042, 2019. Hinton, G. E. and Salakhutdinov, R. R. Reducing the Dimensionality of Data with Neural Networks. Science, 313 (5786):504 507, 2006. Jackson, Z. Free spoken digit dataset (fsdd). https://github.com/Jakobovski/ free-spoken-digit-dataset. Kay, S. M. Estimation Theory. Prentice Hall PTR, 1998. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Master s Thesis, Department of Computer Science, University of Toronto, 2009. Kunin, D., Bloom, J., Goeva, A., and Seed, C. Loss Landscapes of Regularized Linear Autoencoders. In International Conference on Machine Learning, volume 97, pp. 3560 3569, 2019. Ladjal, S., Newson, A., and Pham, C. A PCA-like Autoencoder. ar Xiv preprint ar Xiv:1904.01277, 2019. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Li, M., Zuo, W., Gu, S., Zhao, D., and Zhang, D. Learning Convolutional Networks for Content-Weighted Image Compression. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3214 3223, 2018. doi: 10.1109/CVPR.2018.00339. Liao, L., Zhang, X., Wang, X., Lin, S., and Liu, X. Generalized Image Reconstruction over T-Algebra. ar Xiv:2101.06650, 2021. Lucas, J., Tucker, G., Grosse, R. B., and Norouzi, M. Don t Blame the ELBO! A Linear VAE Perspective on Posterior Collapse. In Advances in Neural Information Processing Systems, volume 32, pp. 9408 9418, 2019. Mc Cree, A. V. and Barnwell, T. P. A Mixed Excitation LPC Vocoder Model for Low Bit Rate Speech Coding. IEEE Transactions on Speech and Audio Processing, 3 (4):242 250, 1995. Oftadeh, R., Shen, J., Wang, Z., and Shell, D. Eliminating the Invariance on the Loss Landscape of Linear Autoencoders. In International Conference on Machine Learning, pp. 7405 7413, 2020. Rippel, O. and Bourdev, L. Real-Time Adaptive Image Compression. In International Conference on Machine Learning, pp. 2922 2930, 2017. Rol ınek, M., Zietlow, D., and Martius, G. Variational Autoencoders Pursue PCA Directions (by Accident). In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. Schroeder, M. and Atal, B. Code-excited Linear Prediction(CELP): High-quality speech at very low bit rates. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 10, pp. 937 940, 1985. Theis, L., Shi, W., Cunningham, A., and Husz ar, F. Lossy Image Compression with Compressive Autoencoders. In International Conference on Learning Reperesentations, 2017. Toderici, G., O Malley, S. M., Hwang, S. J., Vincent, D., Minnen, D., Baluja, S., Covell, M., and Sukthankar, R. Variable Rate Image Compression with Recurrent Neural Networks. In International Conference on Learning Representations, 2016. Toderici, G., Vincent, D., Johnston, N., Hwang, S. J., Minnen, D., Shor, J., and Covell, M. Full Resolution Image Compression with Recurrent Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5435 5443, 2017. Principal Bit Analysis Tschannen, M., Agustsson, E., and Lucic, M. Deep Generative Models for Distribution-Preserving Lossy Compression. In Advances in Neural Information Processing Systems 31, pp. 5929 5940. 2018. Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4):600 612, 2004. Zamir, R. and Feder, M. On Universal Quantization by Randomized Uniform/Lattice Quantizers. IEEE Trans. Inf. Theory, 38:428 436, 1992. Zhou, L., Cai, C., Gao, Y., Su, S., and Wu, J. Variational autoencoder for low bit-rate image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2617 2620, 2018.