# conservativeness_of_untied_autoencoders__00655558.pdf

Conservativeness of Untied Auto-Encoders

Daniel Jiwoong Im

Montreal Institute for Learning Algorithms University of Montreal Montreal, QC, H3C 3J7 imdaniel@iro.umontreal.ca

Mohamed Ishmael Belghazi

HEC Montreal 3000 Ch de la Cte-Ste-Catherine Montreal, QC, H3T 2A7 mohamed.2.belghazi@hec.ca

Roland Memisevic Montreal Institute for Learning Algorithms University of Montreal Montreal, QC, H3C 3J7 roland.memisevic@umontreal.ca

We discuss necessary and sufﬁcient conditions for an autoencoder to deﬁne a conservative vector ﬁeld, in which case it is associated with an energy function akin to the unnormalized log-probability of the data. We show that the conditions for conservativeness are more general than for encoder and decoder weights to be the same ( tied weights ), and that they also depend on the form of the hidden unit activation functions. Moreover, we show that contractive training criteria, such as denoising, enforces these conditions locally. Based on these observations, we show how we can use auto-encoders to extract the conservative component of a vector ﬁeld.

Introduction

An auto-encoder is a feature learning model that learns to reconstruct its inputs by going though one or more capacityconstrained bottleneck -layers. The motivation for autoencoders is that reconstructing input data will be possible only for a model that creates a faithful internal representation of the data. If that representation is forced through a bottleneck, like a low-dimensional or sparse hidden layer, then the model must have learned to compress (or equivalently, understand ) the data. Since technically it deﬁnes a function r : Rn Rn, which maps data to its reconstruction, an auto-encoder can also be viewed as dynamical system, and minimizing reconstruction error can be viewed as a way to encourage the system to have ﬁxed points at the data (Seung 1998). Recent renewed interest in the dynamical systems perspective led to a variety of results that help clarify the role of auto-encoders and their relationship to probabilistic models. For example, (Vincent et al. 2008; Swersky et al. 2011) showed that training an auto-encoder to denoise corrupted inputs is closely related to performing score matching (Hyv arinen 2005) in an undirected model. Similarly, (Alain and Bengio 2014) showed that training the model to denoise inputs, or to reconstruct them under a suitable choice of regularization penalty, lets the auto-encoder approximate the derivative of the empirical data density. And (Kamyshanska 2013) showed that, regardless of training criterion, an auto-

Authors constributed equally. Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

encoder whose weights are tied (decoder-weights are identical to the encoder weights) can be written as the derivative of a scalar potential or energy-function. In general, a dynamical system that can be written as the derivative of such a scalar function is called conservative. When it is conservative, the autoencoder thus deﬁnes, via this scalar function, an energy landscape, which may viewed as an unnormalized log-density in the data space. As such, the autoencoder could in principle be turned into a probabilistic model of the data by simply exponentiating and normalizing this function. In fact, for sigmoid hidden units the potential function is identical to the free energy of an RBM (Kamyshanska 2013). This shows that rather than thinking of autoencoders and RBMs as two distinct types of model, one may think of them as merely two different training schemes to ﬁt the same kind of density function to the data. Besides rooting autoencoders in the realm of probabilistic and energy-based models, the energy function has been of interest because it may help explain how training of multilayer neural networks may be possible via local, biologically plausible updates (see, for example, (Bengio 2014)). For untied auto-encoders it has not been clear whether an energy function exists. It has also not been clear under which conditions an energy function exists or does not exist, or even how to deﬁne it in the case where decoder-weights differ from encoder weights. In this paper, we describe necessary and sufﬁcient conditions for the existence of an energy function ( conservativeness ), and we show that suitable learning criteria will lead to an auto-encoder that satisﬁes these conditions at least locally, near the training data. We verify our results experimentally. We also show how we can use an auto-encoder to extract the conservative part of a vector ﬁeld.

Background We will focus on auto-encoders of the form

r(x) = Rh W Tx + b + c (1)

where x Rn is an observation, W and R are d n encoder and n d decoder weights, respectively, b and c are biases, and h( ) is an elementwise hidden activation function. An auto-encoder can be identiﬁed with its vector ﬁeld, r(x) x, which is the set of vectors pointing from observations to

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

Figure 1: Encoder weights W (left) and decoder weights RT (right).

their reconstructions. The vector ﬁeld is called conservative if it can be written as the gradient of a scalar function F(x), called potential or energy function:

r(x) x = F(x) (2)

The energy function can be thought of as an unnormalized data density, and most common training criteria encourage large values near the data. In the case where an energy function exists, we can integrate Eq. 2 to ﬁnd it. (Kamyshanska 2013) show that for an auto-encoder with tied weights and real-valued observations the energy function takes the general form

F(x) = h(u)du 1

2 x c 2 2 + const (3)

where u = W Tx + b is an auxiliary variable and h( ) can be any elementwise activation function with known antiderivative. For example, the energy function of an autoencoder with sigmoid activation function is identical to the free energy of a Gaussian RBM (Hinton 2010):

k log 1 + exp W T kx + bk

2 x c 2 2 + const (4)

A sufﬁcient condition for the existence of an energy function is that the weights are tied (Kamyshanska 2013), but it has not been clear if this is also necessary. A peculiar phenomenon in practice is that it is very common for decoder and encoder weights to be similar (albeit not necessarily tied) in response to training. An example of this effect

is shown in Figure 11. This raises the question of why this happens, and whether the quasi-tying of weights has anything to do with the emergence of an energy function, and if yes, whether there is a way to compute the energy function despite the lack of exact symmetry. We shall address these questions in what follows.

Conservative auto-encoders One of the central objectives of this paper is understanding the conditions for an auto-encoder to be conservative2 and thus to have a well-deﬁned energy function. In the following subsection we derive and explain said conditions.

Conditions for conservative auto-encoders Proposition 1. Consider an m-hidden-layer auto-encoder deﬁned as

r(x; θ) = W (m)h(m) W (m 1)h(m 1)

W (1)h(1) (x) + c(m 1) + c(m),

where θ = m k=0θ(k) such that θ(k) = {W (k), c(k)} are the parameters of the model, and h(k)( ) is a smooth elementwise activation function at layer k. Then the auto-encoder is said to be conservative over a smooth simply connected domain K RD if and only if its reconstruction s Jacobian r(x)

x is symmetric for all x K.

1We found these kinds of behaviours not only for unwhitened, but also for binary data. 2The expressions, conservative vector ﬁeld and conservative auto-encoders will be used interchangeably.

A formal proof is provided in the supplementary material. A region K is said to be simply connected if and only if any simple curve in K can be shrunk to a point. It is not always the case that a region of RD is simply connected. For instance, a curve surrounding a punctured circle in R2 cannot be continuously deformed to a point without crossing the punctured region. However, as long as we make the reasonable assumption that the activation function does not have a continuum of discontinuities, we should not run into trouble. This makes our analysis valid for activation functions with cusps such as Re LUs. Throughout the paper, our focus will be on one hiddenlayer auto-encoders. Although the necessary and sufﬁcient conditions for their conservativeness are a special case of the above proposition, it is worthwhile to derive them explicitly. Proposition 2. Let r(x) be a one-hidden-layer autoencoder with D dimensional inputs and H hidden units,

r(x) = Rh W T x + b + c,

where R, W, b, c are the parameters of the model. Then r(x) deﬁnes a conservative vector ﬁeld over a smooth simply connected domain K RD if and only if RDh W T is symmetric for all x K where Dh = diag (h (x)).

Proof. Following proposition 1, an auto-encoder deﬁnes a conservative vector ﬁeld if and only if its Jacobian is symmetric for all x K.

By explicitly calculating the Jacobian, this is equivalent to

( 1 i < j D)

Rjl Wli Ril Wlj h l (x) = 0 (6)

Deﬁning Dh = diag(h (x)), this holds if and only if

RDh W T = WDh RT (7)

For one-hidden-layer auto-encoders with tied weights, Equation 7 holds regardless of the choice of activation function h and x. Corollary 1. An auto-encoder with tied weights always deﬁnes a conservative vector ﬁeld. Proposition 2 illustrates that the set of all one-layered tied auto-encoders is actually a subset of the set of all conservative one-layered auto-encoders. Moreover, the inclusion is strict. That is to say there are untied conservative autoencoders that are not trivially equivalent to tied ones. As example, let us compare the parametrization of tied and conservative untied linear one-layered auto-encoders. runtied(x) in Eq. 7 deﬁnes a conservative vector ﬁeld if and only RW T = WRT which offers a richer parametrization than the tied linear auto-encoder rtied(x) = WW T x. In the following section we explore in more detail and generality of the parametrization imposed by the conditions above.

Understanding the symmetricity condition Note that if symmetry holds in the Jacobian of an autoencoder s reconstruction function, then the vector ﬁeld is conservative. A sufﬁcient condition for symmetry of the Jacobian is that R can be written

R = CWDh E. (8)

where C and E are symmetric matrices, and C commutes with WDh EDh W T , as this will ensure symmetry of the partial derivatives:

x = RDh W T = CWDh EDh W T (9)

= WDh EDh W T C = WDh RT = r(x)

The case of tied weights (R = W) follows if C and E are the identity, since then r(x)

x = RDh W T = WDh W T . Notice that R = CWDh and R = WDh E are further special cases of the condition R = CWDh E when E is the identity (ﬁrst case) or C is the identity (second case). Moreover, we can also ﬁnd matrices E and C given the parameters W and R, which is shown in Section 1.2 of the supplementary material3.

Conservativeness of trained auto-encoders Following (Alain and Bengio 2014) we will ﬁrst assume that the true data distribution is known and the auto-encoder is trained. We then analyze the conservativeness of autoencoders around ﬁxed points of the data manifold. After that, we will proceed to empirically investigate and explain the tendency of trained auto-encoders to become conservative away from the data manifold. Finally, we will use the obtained results to explain why the product of the encoder and decoder weights become increasingly symmetric in response to training.

Local Conservativeness Let r(x) be an auto-encoder that minimizes a contractionregularized squared loss function averaged over the true data distribution p,

Rd p(x) r(x) x 2 2 + ϵ r(x)

A point x Rd is a ﬁxed point of the auto-encoder if and only if r(x) = x.

Proposition 3. Let r(x) be an untied one-layer autoencoder minimizing Equation 10. Then r(x) is locally conservative as the contraction parameter ϵ tends to zero.

Taking a ﬁrst order Taylor expansion of r(x) around a ﬁxed point x yields

r(x + ϵ) = x + r(x)

T ϵ + o(ϵ) as ϵ 0. (11)

3www.uoguelph.ca/ imj/ﬁles/conservative ae supplementary. pdf

(a) Symmetricity for RDh W with sigmoid units (b) Symmetricity for RW with sigmoid units

(c) Symmetricity for RDh W with Re LU units (d) Symmetricity for RW with Re LU units

Figure 2: Symmetricity of r(x)

x and RW T for sigmoid and Re LU activations over training time.

(Alain and Bengio 2014) show that the reconstruction r(x) x becomes an estimator of the score when r(x) x 2 2 is small and the contraction parameters ϵ 0. Hence around a ﬁxed point we have

r(x + ϵ) x = ϵ log(p(x))

x , and (12)

(r(x + ϵ) x)

x = ϵ 2 log(p(x))

By explicitly expressing the Jacobian of the autoencoder s dynamics r(x) x

x and using the Taylor expansion of r(x), we have

W T Dh RT I = ϵ 2 log(p(x))

The Hessian of log p(x) being symmetric, Equation 14 illustrates that around ﬁxed points, RDh W is symmetric. In conjunction with Proposition 2, this shows that untied autoencoders, when trained using a contractive regularizer, are locally conservative. Notice that when the auto-encoder is trained with patterns drawn from a continuous family, then the auto-encoder forms a continuous attractor that lies near the examples it is trained on (Seung 1998). It is worth noting that dynamics around ﬁxed points can be understood by analyzing the eigenvalues of the Jacobian. The latter being symmetric implies that its eigenvalues cannot have complex parts, which corresponds to the lack of

Table 1: Symmetricity of ADW after training AEs with 500 units on MNIST for 100 epochs. We denote the autoencoders with weight length constraints as +wl .

Re LU Re LU+wl sig. sig.+wl AE 95.9% 98.7% 95.1% 99.1% CAE 95.2% 98.6% 97.4% 99.1%

oscillations in a conservative vector ﬁeld. Moreover, in directions orthogonal to the ﬁxed point, the eigenvalues of the reconstruction will be negative. Thus the ﬁxed point is actually a sink.

Empirical Conservativeness We now empirically analyze the conservativeness of trained untied auto-encoders. To this end, we train an untied autoencoder with 500 hidden units with and without weight length constraints4 on the MNIST dataset. We measure symmetricity using sym(A) = (A+AT )/2 2

A 2 which yields values between [0, 1] with 1 representing complete symmetricity. Figure 2a and 2c shows the evolution of the symmetricity of r(x)

x = RDh W during training. For untied auto-encoders, we observe that the Jacobian becomes increasingly symmetric as training proceeds and hence, by Proposition 2, the

4Weight length constraints : ||wi||2 = α for all i = 1 H and α is a constant term.

auto-encoder becomes increasingly conservative. The contractive auto-encoder tends more towards complete symmetricity than the unregularized one. Their symmetricity scores plateau at 0.974 and 0.951 respectively. It is interesting to note that auto-encoders with weight length constraints yield sensibly higher symmetricity scores as shown in Table 1. The details of the experiments and further interpretations are provided in the supplementary material. To explicitly conﬁrm the conservativeness of the autoencoder in 2D, we monitor the curl of the vector ﬁeld during training. In our experiments, we created three 2D synthetic datasets by adding gaussian white noise to the parametrization of a line, a circle, and a spiral. As shown in Figure 3, we notice that the curl decreases sharply during training, further demonstrating how untied auto-encoders become more conservative during training. More results on the line, circle, and spiral synthetic datasets can be found in the supplementary material.

Symmetricity of weights product

The product of weight RW T tends to become increasingly symmetric during training. This behavior is more marked for sigmoid activations than for Re LUs as shown in Figures 2b and 2d. This can be explained by considering the symmetricity of the Jacobian. After training, it holds that

l=1 (Ril Wlj Rjl Wli)h l(x) 0, 1 i, j d (15)

This implies that the activations of sigmoid hidden units, at least for training data points, are independent of h (x) or constant. As shown in the supplementary material, most hidden unit activities are concentrated in the highest curvature region when training with weight length constraints. This forces hl(x) to be concentrated on high curvature regions of the sigmoid activation. This may be due to either h l(x) being nearly constant for all l given x, or h l(x) being close to linearly independent. In both cases, the Jacobian becomes close to the identity, and hence RW T WRT .

Decomposing the Vector Field

In this section, we consider ﬁnding the closest conservative vector ﬁeld, in a least square sense, to a non-conservative vector ﬁeld. Finding this vector ﬁeld is of great practical importance in many areas of science and engineering (Bhatia et al. 2013). Here we show that conservative auto-encoders can provide a powerful, deep learning based perspective onto this problem. The fundamental theorem of vector calculus, also known as Helmhotz decomposition states that any vector ﬁeld in R3 can be expressed as the orthogonal sum of an irrotational and a solenoidal ﬁeld. The Hodge decomposition is a generalization of this result to high dimensional space (James 1966). A complete statement of the result requires careful analysis of boundary conditions as well as differential form formalism. But since 1-forms correspond to vector ﬁelds, and our

interest lies in the latter, we abuse notation to state the result in the special case of 1-forms as

ω = dα + δβ + γ (16)

where d is the exterior derivative, δ the co-differential, and Δγ = 0.5 This means that any 1-form (vector ﬁeld) can be orthogonally decomposed into a direct sum of a scalar, solenoidal, and harmonic component. This shows that it is always theoretically possible to get the closest conservative vector ﬁeld, in the least square sense, to a non-conservative one. When applied to autoencoders, this guarantees the existence of a best approximate energy function for any untied conservative auto-encoder. For a more detailed background on the vector ﬁeld decomposition we refer to the supplementary material.

Extracting the Conservative Vector Field through Learning

Although the explicit computation of the projection might be theoretically possible in special cases, we propose to ﬁnd the best approximate conservative vector through learning. There are several advantages to learning the conservative part of a vector ﬁeld: i) Learning the scalar vector ﬁeld component α from some vector ﬁeld ω with an auto-encoder is straightforward due to the intrinsic tendency of the trained auto-encoder to become conservative, ii) although there is a large body of literature to explicitly compute the projections, these methods are highly sensitive to boundary conditions (Bhatia et al. 2013) while learning based methods eschew this difﬁculty. The advantage of deep learning based methods over existing approaches, such as matrix-valued radial basis function kernels (Macedo and Castro 2008), is that they can be trained on very large amounts of data. To the best of our knowledge, this is the ﬁrst application of neural networks to the problem of extracting the conservative part of any vector ﬁeld (effectively recovering the scalar part of Eq. 16).

Two Dimensional space As a proof of concept, we ﬁrst extract the conservative part of a two dimensional vector ﬁeld F(x, y) = ( x + y, x y). The ﬁeld corresponds to a spiralling sink. We train an untied auto-encoder with 1000 Re LU units for 500 epochs using BFGS over an equally spaced grid of 100 points in each dimension. Figure 4 shows how the conservative part is perfectly recovered.

High Dimensional space We also conducted experiments with high dimensional vector ﬁelds. We created a continum of vector ﬁelds by considering convex combinations of a conservative and a non-conservative ﬁeld. The former is obtained by training a tied auto-encoder on MNIST and the latter by setting the parameters of an auto-encoder to random values. That is, we have (Wi, Ri) = β(W0, R0) + (1 β)(WK, RK) where (W0, R0) is the non-conservative autoencoder and (WK, RK) is the conservative auto-encoder. We repeatedly train a tied auto-encoder on this continuum

5For Laplace-de Rham, Δ = dδ + δd. Standard Δ on 1-forms is dδ.

(a) Initial vector ﬁeld

(b) Final vector ﬁeld (c) Magnitude of curl during training

Figure 3: Initial and ﬁnal vector ﬁeld after training an untied auto-encoder on the spiral dataset.

Figure 4: Vector ﬁeld learning by tied (Middle) and united (Right) auto-encoder on 2D unconservative vector ﬁeld (Left).

Algorithm 1 Learning to approximate a conservative ﬁeld with an auto-encoder

1: procedure (D be a data set ) 2: Let (W0, R0) be a random weights for AE. 3: Let (WK, RK) be trained AE on D. 4: Generate Fi i = 1 K as follows: (Wi, Ri) = β(W0, R0) + (1 β)(WK, RK) Sample xi from uniform distributon in the data space. Fi = {(xi, r(xi))fori = 1 N} 5: for each vector ﬁeld Fi, do 6: Train a tied Auto-encoder on Fi 7: Compute E(x) where x D 8: Compute E( x) where x Binomial 9: Count number of E(x) > E( x).

in order to learn its conservative part. The pseudocode for the experiment is presented in Algorithm 1. Figure 5 shows the mean squared error as a function of training epoch for different values of β. We observe that the auto-encoder s loss function decreases as β gets closer to 1. This is due to the auto-encoder only being able to learn the conservative component of the vector ﬁeld. We then compare the unnormalized model evidence of the auto-encoders. The comparison is based on computing the potential energy of auto-encoders given two points at a

time. These two points are from the MNIST and a corrupted version of the latter using salt and pepper noise. We count the number of times where E(x) > E(xrand). Given that the weights (WK, RK) of the conservative auto-encoder are obtained by training it on MNIST, the potential energy at MNIST data points should be higher than that at the corrupted MNIST data points. However, this does not hold for

Figure 5: Learning curves for tied (dashed) and untied (solid) auto-encoders.

β < 1. Even for β = 0.6, we can recover the conservative component of the vector ﬁeld up to 93% . Thus, we conclude

Table 2: The fraction of observations with E(x) > E(xrand) for different β values.

β 0.0 0.2 0.4 0.6 0.8 1.0 CVF=Tied AE 0.5036 0.7357 0.9338 0.98838 0.9960 0.9968 CVF=Untied AE 0.5072 0.7496 0.9373 0.98595 0.9958 0.9968

that the tied auto-encoder is able to learn the conservative component of the vector ﬁeld. The procedure is detailed in Algorithm 1. Table 2 shows that, on average, the auto-encoder potential energy increasingly favors the original MNIST examples over the corrupted ones as the vector ﬁeld Fi moves from 0 to K. CVF=Tied AE refers to conservative vector ﬁeld FK trained by the tied auto-encoder and CVF=Untied AE refers to conservative vector ﬁeld FK learned by the untied auto-encoder.

References Alain, G., and Bengio, Y. 2014. What regularized autoencoders learn from the data generating distribution. In Internaton Conference on Learning Representations. Bengio, Y. 2014. How auto-encoders could provide credit assignment in deep networks via target propagation. ar Xiv preprint ar Xiv:1407.7906. Bhatia, H.; Norgard, G.; Pascucci, V.; and Bremer, P.- T. 2013. The helmtholz-hodge decomposition-a survey. IEEE Transactions on visualization and computer graphics 19:1386 1404. Hinton, G. 2010. A practical guide to training restricted boltzmann machines, version 1. Momentum 9. Hyv arinen, A. 2005. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research 6. James, R. 1966. Advanced calculus. In Advanced Calculus. Belmont, CA: Wadsworth. Kamyshanska, H. 2013. On autoencoder scoring. In Proceedings of the International Conference on Machine Learning (ICML). Macedo, I., and Castro, R. 2008. Numerical solution of the naiver-stokes equations. In Technical report. Seung, H. 1998. Learning continuous attractors in recurrent networks. In Proceedings of the Neural Information Processing Systems (NIPS), 654 660. Swersky, K.; Buchman, D.; Marlin, B.; and de Freitas, N. 2011. On autoencoders and score matching for energy based models. In International Conference on Machine Learning (ICML). Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (ICML).