# a_freeenergy_principle_for_representation_learning__21da0bf4.pdf

A Free-Energy Principle for Representation Learning

Yansong Gao 1 Pratik Chaudhari 1

This paper employs a formal connection of machine learning with thermodynamics to characterize the quality of learnt representations for transfer learning. We discuss how informationtheoretic functionals such as rate, distortion and classiﬁcation loss of a model lie on a convex, so-called equilibrium surface. We prescribe dynamical processes to traverse this surface under constraints, e.g., an iso-classiﬁcation process that trades off rate and distortion to keep the classiﬁcation loss unchanged. We demonstrate how this process can be used for transferring representations from a source dataset to a target dataset while keeping the classiﬁcation loss constant. Experimental validation of the theoretical results is provided on standard imageclassiﬁcation datasets.

1. Introduction

A representation is a statistic of the data that is useful . Classical Information Theory creates a compressed representation and makes it easier to store or transmit data; the goal is always to decode the representation to get the original data back. If we are given images and their labels, we could learn a representation that is useful to predict the correct labels. This representation is thus a statistic of the data sufﬁcient for the task of classiﬁcation. If it is also minimal say in its size it would discard information in the data that is not correlated with the labels. Such a representation is unique to the chosen task, it would perform poorly to predict some other labels correlated with the discarded information. If instead the representation were to have lots of redundant information about the data, it could potentially predict other labels correlated with this extra information.

The premise of this paper is our desire to characterize the

1University of Pennsylvania, USA. Correspondence to: Yansong Gao <gaoyans@sas.upenn.edu>, Pratik Chaudhari <pratikac@seas.upenn.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

information discarded in the representation when it is ﬁt on a task. We want to do so in order to learn representations that can be transferred easily to other tasks.

Our main idea is to choose a canonical task in this paper, we pick reconstruction of the original data as a way to measure the discarded information. Although one can use any canonical task, reconstruction is special. It is a capture all task in the sense that achieving perfect reconstruction entails that the representation is lossless; information discarded by the original task is therefore readily measured as the one that helps solve the canonical task. This leads to the study of the following Lagrangian which is similar to the Information Bottlenck of Tishby et al. (2000)

F(λ, γ) = min θ Θ,eθ(z|x),mθ(z), dθ(x|z),cθ(y|z) R + λD + γC

where the rate R is an upper bound on the mutual information of the representation learnt by the encoder eθ(z|x) with the input data x, distortion D measures the quality of reconstruction of the decoder dθ(x|z) and C measures the classiﬁcation loss of the classiﬁer cθ(y|z). As Alemi & Fischer (2018) show, this Lagrangian can be formally connected to ideas in thermodynamics. We heavily exploit and specialize this point of view, as summarized next.

1.1. Summary of contributions

Our main technical observation is that F(λ, γ) can be intepreted as a free-energy and a stochastic learning process that minimizes its corresponding Hamiltonian converges to the optimal free-energy. This corresponds to an equilibrium surface of information-theoretic functionals R, D and C and a surface Θλ,γ of the model parameters at convergence. We prove that the equilibrium surface is convex and its dual, the free-energy F(λ, γ), is concave. The freeenergy is only a function of Lagrange multipliers (λ, γ), the family of model parameters Θ, and the task, and is therefore invariant of the learning dynamics.

Second, we design a quasi-static stochastic process, akin to an equilibrium process in thermodynamics, to keep the model parameters θ on the equilibrium surface. Such a process allow us to travel to any feasible values of (R, D, C) while ensuring that the parameters θ of the model are on the equilibrium surface. We focus on one process, the iso-

A Free-Energy Principle for Representation Learning

classiﬁcation process which automatically trades off the rate and distortion to keep the classiﬁcation loss constant.

We prescribe a quasi-static process that allows for a controlled transfer of learnt representations. It adapts the model parameters as the task is changed from some source dataset to a target dataset while keeping the classiﬁcation loss constant. Such a process is in stark contrast to current techniques in transfer learning which do not provide any guarantees on the quality of the model on the target dataset.

We provide extensive experimental results which realize the theory developed in this paper.

2. Theroetical setup

This section introduces notation and preliminaries that form the building blocks of our approach.

2.1. Auto-Encoders

Consider an encoder e(z|x) that encodes data x into a latent code z and a decoder d(x|z) that decodes z back into the original data x. If the true distribution of the data is p(x) we may deﬁne the following functionals.

H = E x p(x)

D = E x p(x)

Z dz e(z|x) log d(x|z)

R = E x p(x)

Z dz e(z|x) log e(z|x)

We denote expectation over data using the notation ϕ p(x) = R dx p(x)ϕ. The ﬁrst functional H is the Shanon entropy of the true data distribution; it quantiﬁes the complexity of the data. The distortion D measures the quality of the reconstruction through its log-likelihood. The rate R is a Kullback-Leibler (KL) divergence; it measures the average excess bits used to encode samples from e(z|x) using a code that was built for our approximation of the true marginal on the latent factors m(z).

2.2. Rate-Distortion curve

The functionals in (1) come together to give the inequality

H D Ie(x; z) R (2)

where Ie = KL(e(z|x) || p(z|x)) is the KL-divergence between the learnt encoder and the true (unknown) conditional of the latent factors. The outer inequality H D + R forms the basis for a large body of literature on Evidence Lower Bounds (ELBO, see Kingma & Welling (2013)). Consider Fig. 1a, if the capacity of our candidate distributions e(z|x), m(z) and d(x|z) is inﬁnite, we can obtain the equality H = R + D. This is the thick black line in Fig. 1a.

For ﬁnite capacity variational families, say parameterized by θ, which we denote by eθ(z|x), dθ(x|z) and mθ(z) respectively, as Alemi et al. (2017) argue, one obtains a convex RD curve (shown in red in Fig. 1a) corresponding to the Lagrangian

F(λ) = min eθ(z|x),mθ(z),dθ(x|z) R + λD. (3)

This Lagrangian is the relaxation of the idea that given a ﬁxed variational family and data distribution p(x), there exists an optimal value of, say, rate R = f(D) that best sandwiches (2). The optimal Lagrange multiplier is λ = R

D evaluated at the desired value of D.

(b) Figure 1. Schematic of the equilibrium surface. Fig. 1a shows that rate (R) and distortion (D) trade off against each other on the equilibrium surface. Similarly in Fig. 1b, the equilibrium surface is a convex constraint that joins rate, distortion and the classiﬁcation loss. Training objectives with different (λ, γ) (shown in red and blue) reach different parts of the equilibrium surface.

2.3. Incorporating the classiﬁcation loss

Let us create a classiﬁer that uses the learnt representation z as the input and set the classiﬁcation loss as the negative log-likelihood of the prediction

C = E x p(x)

Z dz e(z|x) log c(y|z) . (4)

If the parameters of the model which now consists of the encoder e(z|x), decoder d(x|z) and the classiﬁer c(y|z) are denoted by θ, the training process for the model induces a distribution p(θ| {(x, y)}) where {(x, y)} denotes a ﬁnite dataset. In addition to R, D and C, the authors in Alemi & Fischer (2018) deﬁne

S = E x p(x),y p(y|x)

log p(θ| {x, y})

which is the relative entropy of the distribution on parameters θ after training compared to a prior distribution m(θ) of our choosing. Using a very similar argument as Section 2.2 the four functionals R, D, C and S form a convex threedimensional surface in the RDCS phase space. A schematic

A Free-Energy Principle for Representation Learning

is shown in Fig. 1b for σ = 0. We can again consider a Lagrange relaxation of this surface given by

F(λ, γ, σ) = min e(z|x),m(z),d(x|z),c(y|z) R + λD + γC + σS.

Remark 1 ( The First Law of learning). Alemi & Fischer (2018) draw formal connections of the Lagrangian in (6) with the theory of thermodynamics. Just like the ﬁrst law of thermodynamics is a statement about the conservation of energy in physical processes, the fact that the four functionals are tied together in a smooth constraint f(R, D, C, S) = 0 leads to an equation of the form

d R = λ d D γ d C σ d S (7)

which indicates that information in learning processes is conserved. The information in the latent representation z is kept either to reconstruct back the original data or to predict the labels. The former is captured by the encoder-decoder pair, the latter is captured by the classiﬁer.

Remark 2 (Setting σ = 0). The distribution p(θ| {(x, y)}) is a posterior on the parameters of the model given the dataset. While this distribution is well-deﬁned under minor technical conditions, e.g., ergodicity, performing computations with this distribution is difﬁcult. We therefore only consider the case when σ = 0 in the sequel and leave the general case for future work.

The following lemma (proved in Appendix B) shows that the constraint surface connecting the information-theoretic functionals R, D and C is convex and its dual, the Lagrangian F(λ, γ) is concave.

Lemma 3 (The RDC constraint surface is convex). The constraint surface f(R, D, C) = 0 is convex and the Lagrangian F(λ, γ) is concave.

We can show using a similar proof that the entire surface joining R, D, C and S is convex by considering the cases λ = 0 and γ = 0 separately. Note that the constraint is convex in R, D and C; it need not be convex in the model parameters θ that parameterize eθ(z|x), mθ(z), etc.

2.4. Equilibrium surface of optimal free-energy

We next elaborate upon the objective in (6). Consider the functionals R, D and C parameterized using parameters θ Θ RN. First, consider the problem

F(λ, γ) = min e(z|x), θ Θ R + λD + γC. (8)

We can solve this using calculus of variations to get

e(z|x) mθ(z)dθ(x|z)λ exp γ Z dy p(y|x) log cθ(y|z) .

We assume in this paper that the labels are a deterministic function of the data, i.e., p(y|x) = δ(y yx) where yx is the true label of the datum x. We therefore have

e(z|x) = mθ(z)dθ(x|z)λcθ(yx|z)γ

where the normalization constant is

Zθ,x = Z dz mθ(z)dθ(x|z)λcθ(yx|z)γ. (9)

The objective F(λ, γ) can now be rewritten as maximizing the log-partition function, also known as the free-energy in statistical physics (Mezard & Montanari, 2009),

F(λ, γ) = min θ Θ log Zθ,x p(x) . (10)

Remark 4 (Why is it called the equilibrium surface?). Given a ﬁnite dataset {(x, y)}, one may minimize the objective in (8) using stochastic gradient descent (SGD, Robbins & Monro (1951)) on a Hamiltonian

H(z; x, θ, λ, γ) log mθ(z) λ log dθ(x|z) γ log cθ(y|z) (11) with updates given by

θk+1 = θk σ θ E x p(x)

Z dz eθk(z|x)H(z; x, θk, λ, γ)

(12) where σ > 0 is the step-size; the gradient θ is evaluated over samples from p(x) and eθ(z|x). Using the same technique as that of Chaudhari & Soatto (2017), one can show that the objective

E θ p(θ|{x,y})

log Zθ,x p(x) σH(p(θ | {x, y})).

decreases monotonically. Observe that our objective in (8) corresponds to the limit σ 0 of this objective along with a uniform non-informative prior m(θ) in (5). In fact, this result is analogous to the classical result that an ergodic Markov chain makes monotonic improvements in the KL-divergence as it converges to the steady-state, also known as, equilibrium, distribution (Levin & Peres, 2017). The posterior distribution of the model parameters induced by the stochastic updates in (12) is the Gibbs distribution p (θ | {(x, y)}) exp ( 2(R + λD + γC)/σ).

It is for the above reason that we call the surface in Fig. 1b parameterized by

Θλ,γ = n θ Θ : log Zθ,x p(x) = F(λ, γ) o (13)

as the equilibrium surface . Learning, in this case minimizing (8), is initialized outside this surface and converges to speciﬁc parts of the equilibrium surface depending upon (λ, γ); this is denoted by the red and blue curves in Fig. 1b.

A Free-Energy Principle for Representation Learning

The constraint that ties results in this equilibrium surface is that variational inequalities such as (2) (more are given in Alemi & Fischer (2018)) are tight up to the capacity of the model. This is analogous to the concept of equilibrium in thermodynamics (Sethna, 2006)

3. Dynamical processes on the equilibrium surface

This section prescribes dynamical processes that explore the equilibrium surface. For any parameters θ Θ, not necessarily on the equilibrium surface, let us deﬁne

J(θ, λ, γ) = log Zθ,x p(x) . (14)

If θ Θλ,γ we have J(θ, λ, γ) = F(λ, γ) which implies

θJ(θ, λ, γ) = 0 for all θ Θλ,γ. (15)

Quasi-static process. A quasi-static process in thermodynamics happens slowly enough for a system to remain in equilibrium with its surroundings. In our case, we are interested in evolving Lagrange multipliers (λ, γ) slowly and simultaneously keep the model parameters θ on the equilibrium surface; the constraint (15) thus holds at each time instant. The equilibrium surface is parameterized by R, D and C so changing (λ, γ) adapts the three functionals to track their optimal values corresponding to F(λ, γ).

Let us choose some values ( λ, γ) and the trivial dynamics d dtλ = λ and d

dtγ = γ. The quasi-static constraint leads to the following partial differential equation (PDE)

dt θJ(θ, λ, γ) = 2 θJ θ + λ

(16) valid all θ Θλ,γ. At each location θ Θλ,γ the above PDE indicates how the parameters should evolve upon changing the Lagrange multipliers (λ, γ). We can rewrite the PDE using the Hamiltonian H in (11) as shown next.

Lemma 5 (Equilibrium dynamics for parameters θ). Given ( λ, γ), the parameters θ Θλ,γ evolve as

θ = A 1bλ λ + A 1bγ γ

= θλ λ + θγ γ (17)

where H is the Hamiltonian in (11) and

2 θH + θH θH θH θ H ;

All the inner expectations above are taken with respect to the Gibbs measure of the Hamiltonian, i.e., ϕ = R ϕ exp( H(z)) dz R exp( H(z)) dz . The dynamics for the parameters θ is therefore a function of the two directional derivatives

θλ = A 1 bλ, and θγ = A 1 bγ (18)

with respect to λ and γ. Note that A in (17) is the Hessian of a strictly convex functional.

This lemma allows us to implement dynamical processes for the model parameters θ on the equilibrium surface. As expected, this is an ordinary differential equation (17) that depends on our chosen evolution for ( λ, γ) through the directional derivatives θλ, θγ. The utility of the above lemma therefore lies in the expressions for these directional derivatives. Appendix C gives the proof of the above lemma.

Remark 6 (Implementing the equilibrium dynamics). The equations in Lemma 5 may seem complicated to compute but observe that they can be readily estimated using samples from the dataset x p(x) and those from the encoder z eθ(z|x). The key difference between (17) and, say, the ELBO objective is that the gradient in the former depends upon the Hessian of the Hamiltonian H. These equations can be implemented using Hessian-vector products (Pearlmutter, 1994). If the dynamics involves certain constrains among the functionals, as Remark 7 shows, we simplify the implementation of such equations.

3.1. Iso-classiﬁcation process

An iso-thermal process in thermodynamics is a quasi-static process where a system exchanges energy with its surroundings and remains in thermal equilibrium with the surroundings. We now analogously deﬁne an iso-classiﬁcation process that adapts parameters of the model θ while the freeenergy is subject to slow changes in (λ, γ). This adaptation is such that the classiﬁcation loss is kept constant while the rate and distortion change automatically.

Following the development in Lemma 5, it is easy to create an iso-classiﬁcation process. We simply add a constraint of

A Free-Energy Principle for Representation Learning

d dt θJ = 0 (Quasi-Static Condition)

d dt C = 0 (Iso-classiﬁcation Condition). (19)

Using a very similar computation (given in Appendix D) as that in the proof of Lemma 5, this leads to the constrained dynamics 0 = Cλ λ + Cγ γ θ = θλ λ + θγ γ. (20)

The quantities Cλ and Cγ are given by

Cλ = E x p(x)

λ ℓ + θ λ θH ℓ ℓθ λ θH + θ λ θℓ

Cγ = E x p(x)

γ ℓ + θ γ θH ℓ ℓθT γ θH + θ γ θℓ

(21) where ℓ= log cθ(yx|z) is the logarithm of the classiﬁcation loss. Observe that we are not free to pick any values for ( λ, γ) for the iso-classiﬁcation process anymore, the constraint d C

dt = 0 ties the two rates together.

Remark 7 (Implementing an iso-classiﬁcation process). The ﬁrst constraint in (20) allows us to choose

where α is a parameter to scale time. The second equalities in both rows follow because F(λ, γ) is the optimal freeenergy which implies relations like D = F

λ and C = F

γ . We can now compute the two deriatives in (22) using ﬁnite differences to implement an iso-classiﬁcation process. This is equivalent to running the dynamics in (20) using ﬁnitedifference approximation for the terms H

γ . While approximating all these listed quantities at each update of θ would be cumbersome, exploiting the relations in (20) is efﬁcient even for large neural networks, as our experiments show.

Remark 8 (Other dynamical processes of interest). In this paper, we focus on iso-classiﬁcation processes. However, following the same program as that of this section, we can also deﬁne other processes of interest, e.g., one that keeps C +β 1R constant while ﬁne-tuning a model. This is similar to the alternative Information Bottlenck of Achille & Soatto (2017) wherein the rate is deﬁned using the weights of a network as the random variable instead of the latent factors z. This is also easily seen to be the right-hand side of the PAC-Bayes generalization bound (Mc Allester, 2013). A dynamical process that preserves this functional would be able to control the generalization error which is an interesting prospect for future work.

4. Transferring representations to new tasks

Section 3 demonstrated dynamical processes where the Lagrange multipliers λ, γ change with time and the process adapts the model parameters θ to remain on the equilibrium surface. This section demonstrates the same concept under a different kind of perturbation, namely the one where the underlying task changes. The prototypical example one should keep in mind in this section is that of transfer learning where a classiﬁer trained on a dataset ps(x, y) is further trained on a new dataset, say pt(x, y). We will assume that the input domain of the two distributions is the same.

4.1. Changing the data distribution

If i.i.d samples from the source task are denoted by Xs = xs 1, . . . , xs ns and those of the target distribution are Xt = xt 1, . . . , xt nt the empirical source and target distributions can be written as

i=1 δx xs i , and pt(x) = 1

i=1 δx xt i

respectively; here δx x is a Dirac delta distribution at x . We will consider a transport problem that transports the source distribution ps(x) to the target distribution pt(x). For any t [0, 1] we interpolate between the two distributions using a mixture

p(x, t) = (1 t)ps(x) + tpt(x). (23)

Observe that the interpolated data distribution equals the source and target distribution at t = 0 and t = 1 respectively and it is the mixture of the two distributions for other times. We keep the labels of the data the same and do not interpolate them. As discussed in Appendix F we can also use techniques from optimal transportation (Villani, 2008) to obtain a better transport; the same dynamical equations given below remain valid in that case.

4.2. Iso-classiﬁcation process with a changing data distribution

The equilibrium surface Θλ,γ in Fig. 1b is a function of the task and also evolves with the task. We now give a dynamical process that keeps the model parameters in equilibrium as the task evolves quasi-statically. We again have the same conditions for the dynamics as those in (19). The following lemma is analogous to Lemma 5. Lemma 9 (Dynamical process for changing data distribution). Given ( λ, γ), the evolution of model parameters θ for a changing data distribution given by (23) is

θ = θλ λ + θγ γ + θt (24)

θt = A 1 bt =: A 1 Z p(x, t)

t θH dx (25)

A Free-Energy Principle for Representation Learning

and the other quantities are as deﬁned in Lemma 5 with the only change that expectations on data x are taken with respect to p(x, t) instead of p(x). The additional term θt arises because the data distribution changes with time.

A similar computation as that of Section 3.1 gives a quasistatic iso-classiﬁcation process as the task evolves

θ = θλ λ + θγ γ + θt

0 = Cλ λ + Cγ γ + Ct (26)

where Cλ and Cγ are as given in (21) with the only change being that the outer expectation is taken with respect to x p(x, t). The new term that depends on time t is

Ct = Z p(x, t)

t ℓ dx E x p(x,t)

θ t θH ℓ θ t θH ℓ + θ t θℓ

(27) with ℓ= log cθ(yxt|z). Finally get

=: ˆθλ λ + ˆθt

This indicates that θ = θ(λ, t) is a surface parameterized by λ and t, equipped with a basis of tangent plane (ˆθλ, ˆθt).

4.3. Geodesic transfer of representations

The dynamics of Lemma 9 is valid for any ( λ, γ). We provide a locally optimal way to change (λ, γ) in this section.

Remark 10 (Rate-distortion tradeoff). Note that

= α det (Hess(F)) ,

C C = λ D. (29) The ﬁrst equality is simply our iso-classiﬁcation constraint. For α > 0, the second one indicates that D < 0 using Lemma 3 which shows that 0 Hess(F). This also gives λ > 0 in (22). The third equality is a powerful observation: it indicates a trade-off between rate and distortion, if D < 0 we have R > 0. It also shows the geometric structure of the equilibrium surface by connecting R and D together, which we will exploit next.

Computing the functionals R, D and C during the isoclassiﬁcation transfer presents us with a curve in RDC space. Geodesic transfer implies that the functionals R, D follow the shortest path in this space. But notice that if we assume that the model capacity is inﬁnite, the RDC

space is Euclidean and therefore the geodesic is simply a straight line. Since we keep the classiﬁcation loss constant during the transfer, C = 0, straight line implies that slope d D/d R is a constant, say k. Thus D = k R. Observe that R = R D D + R

t = λ D + R

t . Combining the iso-classiﬁcation constraint and the fact that D = k R = kλ D + k R

t , gives us a linear system:

We solve this system to update (λ, γ) during the transfer.

5. Experimental validation

This section presents experimental validation for the ideas in this paper. We ﬁrst implement the dynamics in Section 3 that traverses the equilibrium surface and then demonstrate the dynamical process for transfer learning devised in Section 4.

Setup. We use the MNIST (Le Cun et al., 1998) and CIFAR-10 (Krizhevsky, 2009) datasets for our experiments. We use a 2-layer fully-connected network (same as that of Kingma & Welling (2013)) as the encoder and decoder for MNIST; the encoder for CIFAR-10 is a Res Net-18 (He et al., 2016) architecture while the decoder is a 4-layer deconvolutional network (Noh et al., 2015). Full details of the pre-processing, network architecture and training are provided in Appendix A.

5.1. Iso-classiﬁcation process on the equilibrium surface

This experiment demonstrates the iso-classiﬁcation process in Remark 7. As discussed in Remark 4, training a model to minimize the funtional R + λD + γC decreases the freeenergy monotonically.

Details. Given a value of the Lagrange multipliers (λ, γ) we ﬁrst ﬁnd a model on the equilibrium surface by training from scratch for 120 epochs with the Adam optimizer (Kingma & Ba, 2014); the learning rate is set to 10 3 and drops by a factor of 10 every 50 epochs. We then run the iso-classiﬁcation process for these models in Remark 7 as follows. We modify (λ, γ) according to the equations

γ and γ = α C

Changes in (λ, γ) cause the equilibrium surface to change, so it is necessary to adapt the model parameters θ so as to keep them on the dynamically changing surface; let us call this proces of adaptation equilibriation . We achieve this by taking gradient-based updates to minimize J(λ, γ) with a

A Free-Energy Principle for Representation Learning

(0.25, 4) (0.25, 6) (0.25, 8) (0.25, 10) (0.25, 15)

1 2 3 Lambda

(0.25, 4) (0.25, 6) (0.25, 8) (0.25, 10) (0.25, 15)

0 20 40 60 Number of Steps

Validation Loss

(0.25, 4) (0.25, 6) (0.25, 8) (0.25, 10) (0.25, 15)

(c) Figure 2. Iso-classiﬁcation process for MNIST. We run 5 different experiments for initial Lagrange multipliers given by λ = 0.25 and γ {4, 6, 8, 10, 15}. During each experiment, we modify these Lagrange multipliers (Fig. 2b) to keep the classiﬁcation loss constant and plot the rate-distortion curve (Fig. 2a) along with the validation loss (Fig. 2c). The validation accuracy is constant for each experiment; it is between 92 98% for these initial values of (λ, γ). Similarly the training loss is almost unchanged during each experiment and takes values between 0.06 0.2 for different values of (λ, γ).

20 25 30 Rate

(0.5, 5) (0.5, 10) (0.5, 15) (0.5, 20)

0.50 0.75 1.00 1.25 Lambda

(0.5, 5) (0.5, 10) (0.5, 15) (0.5, 20)

0 10 20 30 Number of Steps

Validation Accuracy

(0.5, 5) (0.5, 10) (0.5, 15) (0.5, 20)

(c) Figure 3. Iso-classiﬁcation process for CIFAR-10. We run 4 different experiments for initial Lagrange multipliers λ = 0.5 and γ {5, 10, 15, 20}. During each experiment, we modify the Lagrange multipliers (Fig. 3b) to keep the classiﬁcation loss constant and plot the rate-distortion curve (Fig. 3a) along with the validation accuracy (Fig. 3c). The validation loss is constant during each experiment; it takes values between 0.5 0.8 for these initial values of (λ, γ). Similarly, the training loss is constant and takes values between 0.02 0.09 for these initial values of (λ, γ). Observe that the rate-distortion curve in Fig. 3a is much ﬂatter than the one in Fig. 2a which indicates that the model family Θ for CIFAR-10 is much more powerful; this corresponds to the straight line in the RD curve for an inﬁnite model capacity is as shown in Fig. 1a.

learning rate schedule that looks like a sharp quick increase from zero and then a slow annealing back to zero. The learning rate schedule is given by η(t) = (t/T)2 (1 t/T)5

where t is the number of mini-batch updates taken since the last change in (λ, γ) and T is total number of mini-batch updates of equilibriation. The maximum value of the learning rate is set to 1.5 10 3. The free-energy should be unchanged if the model parameters are on the equilibrium surface after equilibriation; this is shown in Fig. 4a. Partial derivatives in (31) are computed using ﬁnite-differences.

Fig. 2 shows the result for the iso-classiﬁcation process for MNIST and Fig. 3 shows a similar result for CIFAR10. We can see that the classiﬁcation loss remains constant through the process. This experiment shows that we can implement an iso-classiﬁcation process while keeping the model parameters on the equilibrium surface during it.

5.2. Transfer learning between two subsets of MNIST

We next present experimental results of an iso-classiﬁcation process for transferring the learnt representation. We pick the source dataset to be all images corresponding to digits 0 4 in MNIST and the target datast is its complement, images of digits 5 9. Our goal is to adapt a model trained on the source task to the target task while keeping its classiﬁcation loss constant. We run the geodesic transfer dynamics from Section 4.3 and the results are shown in Fig. 5.

It is evident that the classiﬁcation accuracy is constant throughout the transfer and is also the same as that of training from scratch on the target. MNIST is an simple dataset and the accuracy gap between iso-classiﬁcation transfer, ﬁne-tuning from the source and training from scratch is minor. The beneﬁt of running the iso-classiﬁcation transfer however is that we can be guaranteed about the ﬁnal accuracy of the model. The gap between these three to

A Free-Energy Principle for Representation Learning

0 50 100 Epoch

Free Energy

0 20 40 60 Number of Steps

Free Energy

(b) Figure 4. Variation of the free-energy F(λ, γ) across the equilibriation and the iso-classiﬁcation processes. Fig. 4a shows the free-energy during equilibriation between small changes of (λ, γ). The initial and ﬁnal values of the Lagrange multipliers are (0.5, 1) and (0.51, 1.04) respectively and the free-energy is about the same for these values. Fig. 4b shows the free-energy as (λ, γ) undergo a large change from their initial value of (0.25, 4) to (3.5, 26) during the iso-classiﬁcation process in Fig. 2. Since the rate-distortion change a lot (Fig. 2a), the free-energy also changes a lot even if C is constant (Fig. 2c). Number of steps in Fig. 4b refers to the number of steps of running (31).

Geodesic Quasi-Static Process

0 20 40 number of steps

Validation Accuracy

Non-Equilibrium Process Geodesic Quasi-Static Process Training on Target Domain

(b) Figure 5. Transfering from source dataset of MNIST digits 0 4 to the target dataset consisting of digits 5 9. Fig. 5a shows the variation of rate and distortion during the transfer; as discussed in Section 4.3 we maintain a constant d R/d D during the transfer; the rate decreases and the distortion increases. Fig. 5b shows the validation accuracy during the transfer. The orange curve corresponds to geodesic iso-classiﬁcation transfer; the blue curve is the result of directly ﬁne-tuning the source model on the target data (note the very low accuracy at the start); the green point is the accuracy of training on the target task from scratch.

be signiﬁcant for more complex datasets in the following section.

5.3. Transfer learning between two subsets of CIFAR10

The iso-classiﬁcation process is a quasi-static process, i.e., the model parameters θ are lie on the equilibrium surface at all times t [0, 1] during the transfer. Note that both the equilibrium surface and the free-energy F(λ, γ) are functions of the data and change with time. Let us write this explicitly as

F(t) := R(t, λ(t), γ(t)) + λD(t, λ(t), γ(t)) + γC0

where C0 is the classiﬁcation loss. We prescribed a geodesic transfer above where the Lagrange multipliers λ, γ were adapted simultaneously to conﬁrm to the constraints of the equilibrium surface locally. We can also adapt them using the following heuristic. We let λ = k for some constant k and use C

t = 0, (32)

to get the evolution curve of γ(t).

Here we present experimental results of an iso-classiﬁcation process for transferring the learnt representation. We pick the source dataset to be all vehicles (airplane, automobile, ship and truck) in CIFAR-10 and the target dataset consists of four animals (bird, cat, deer and dog). We set the output size of classiﬁer to be four. Our goal is to adapt a model trained on the source task to the target task while keeping its classiﬁcation loss constant. We run the iso-c transfer dynamics (32) and the results are shown in Fig. 6.

0 5 10 number of steps

Validation Loss

Non-Equilibrium Process Iso-Classification Process Training on Target Domain

0 20 40 number of steps

Validation Accuracy

Non-Equilibrium Process Geodesic Quasi-Static Process Training on Target Domain

Figure 6. Transferring from source dataset of CIFAR-10 vehicles to the target dataset consisting of four animals. Fig. 6a shows the variation of validation loss during the transfer. Fig. 6b shows the validation accuracy during the transfer. The orange curve corresponds to iso-classiﬁcation transfer; the blue curve is the result of directly ﬁne-tuning the source model on the target data (note the very low accuracy at the start); the green point is the accuracy of training on the target task from scratch.

It is evident that both the classiﬁcation accuracy and loss are constant throughout the transfer. CIFAR-10 is a more complex dataset as comparing with MNIST and the accuracy gap between iso-classiﬁcation transfer, ﬁne-tuning from the source and training from scratch is signiﬁcant. Observe that the classiﬁcation loss gap between iso-classiﬁcation transfer and training from scratch on the target is also signiﬁcant. The beneﬁt of running the iso-classiﬁcation transfer is that we can be guaranteed about the ﬁnal accuracy and validation loss of the model.

Details of the experimental setup for CIFAR-10 transferring. At moment t, parameters λ, γ determine our objective functions. We compute iso-classiﬁcation loss transfer process by ﬁrst setting initial states: (λ = 4, γ = 100). We train on source dataset for 300 epochs with Adam and

A Free-Energy Principle for Representation Learning

a learning rate of 1E-3 that drops by a factor of 10 after every 120 epochs to obtain the initial state. We change λ, γ with respect to time t and then apply the equilibration learning rate schedule of Fig. 4a to achieve the transition between equilibrium states. We compute the partial derivatives C

γ by using ﬁnite difference. At each time t, solving (32) with the partial derivatives leads to the solution for γ, where λ is a constant. In our experiment we set λ = 1.5.

6. Related work

We are motivated by the Information Bottleneck (IB) principle of Shwartz-Ziv & Tishby (2017); Tishby et al. (2000), which has been further explored by Achille & Soatto (2017); Alemi et al. (2016); Higgins et al. (2017). The key difference in our work is that while these papers seek to understand the representation for a given task, we focus on how the representation can be adapted to a new task. Further, the Lagrangian in (8) has connections to PAC-Bayes bounds (Dziugaite & Roy, 2017; Mc Allester, 2013) and training algorithms that use the free-energy (Chaudhari et al., 2019). Our use of rate-distortion for transfer learning is close to the work on unsupervised learning of Brekelmans et al. (2019); Ver Steeg & Galstyan (2015).

This paper builds upon the work of Alemi & Fischer (2018); Alemi et al. (2017). We reﬁne some results therein, viz., we provide a proof of the convexity of the equilibrium surface and identify it with the equilibrium distribution of SGD (Remark 4). We introduce new ideas such as dynamical processes on the equilibrium surface. Our use of thermodynamics is purely as an inspiration; the work presented here is mathematically rigorous and also provides an immediate algorithmic realization of the ideas.

This paper has strong connections to works that study stochastic processes inspired from statistical physics for machine learning, e.g., approximate Bayesian inference and implicit regularization of SGD (Chaudhari & Soatto, 2017; Mandt et al., 2017), variational inference (Jordan et al., 1998; Kingma & Welling, 2013). The iso-classiﬁcation process instantiates an automatic regularization via the trade-off between rate and distortion; this point-of-view is an exciting prospect for future work. The technical content of the paper also draws from optimal transportation (Villani, 2008).

A large number of applications begin with pre-trained models (Girshick et al., 2014; Sharif Razavian et al., 2014) or models trained on tasks different (Doersch & Zisserman, 2017). Current methods in transfer learning however do not come with guarantees over the performance on the target dataset, although there is a rich body of older work (Baxter, 2000) and ongoing work that studies this (Zamir et al.,

2018). The information-theoretic understanding of transfer and the constrained dynamical processes developed in our paper is a ﬁrst step towards building such guarantees. In this context, our theory can also be used to tackle catastrophic forgetting Kirkpatrick et al. (2017) to detune the model post-training and build up redundant features.

7. Discussion

We presented dynamical processes that maintain the parameters of model on an equilibrium surface that arises out of a certain free-energy functional for the encoder-decoderclassiﬁer architecture. The decoder acts as a measure of the information discarded by the encoder-classiﬁer pair while ﬁtting on a given task. We showed how one can develop an iso-classiﬁcation processs that travels on the equilibrium surface while keeping the classiﬁcation loss constant. We showed an iso-classiﬁcation transfer learning process which keeps the classiﬁcation loss constant while adapting the learnt representation from a source task to a target task.

The information-theoretic point-of-view in this paper is rather abstract but its beneﬁt lies in its exploitation of the equilibrium surface. Relationships between the three functionals, namely rate, distortion and classiﬁcation, that deﬁne this surface, as also other functionals that connect to the capacity of the hypothesis class such as the entropy S may allow us to deﬁne invariants of the learning process. For complex models such as deep neural networks, such a program may lead an

A Free-Energy Principle for Representation Learning

Achille, A. and Soatto, S. On the emergence of invariance and disentangling in deep representations. ar Xiv:1706.01350, 2017.

Alemi, A. A. and Fischer, I. Therml: Thermodynamics of machine learning. ar Xiv preprint ar Xiv:1807.04162, 2018.

Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. ar Xiv:1612.00410, 2016.

Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., and Murphy, K. Fixing a broken elbo. ar Xiv preprint ar Xiv:1711.00464, 2017.

Baxter, J. A model of inductive bias learning. Journal of artiﬁcial intelligence research, 12:149 198, 2000.

Brekelmans, R., Moyer, D., Galstyan, A., and Ver Steeg, G. Exact rate-distortion in autoencoders via echo noise. In Advances in Neural Information Processing Systems, pp. 3884 3895, 2019.

Chaudhari, P. and Soatto, S. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. ar Xiv preprint ar Xiv:1710.11029, 2017.

Chaudhari, P., Choromanska, A., Soatto, S., Le Cun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. Entropysgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.

Doersch, C. and Zisserman, A. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2051 2060, 2017.

Dziugaite, G. K. and Roy, D. M. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. ar Xiv preprint ar Xiv:1703.11008, 2017.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580 587, 2014.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. ar Xiv:1603.05027, 2016.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and A, L. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework . In ICLR, 2017.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. In Learning in graphical models, pp. 105 161. Springer, 1998.

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. ar Xiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. ar Xiv:1312.6114, 2013.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521 3526, 2017.

Krizhevsky, A. Learning multiple layers of features from tiny images. Master s thesis, Computer Science, University of Toronto, 2009.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the

IEEE, 86(11):2278 2324, 1998.

Levin, D. A. and Peres, Y. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.

Mandt, S., Hoffman, M. D., and Blei, D. M. Stochastic Gradient Descent as Approximate Bayesian Inference. ar Xiv:1704.04289, 2017.

Mc Allester, D. A pac-bayesian tutorial with a dropout bound. ar Xiv:1307.2118, 2013.

Mezard, M. and Montanari, A. Information, physics, and computation. Oxford University Press, 2009.

Noh, H., Hong, S., and Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 1520 1528, 2015.

Pearlmutter, B. A. Fast exact multiplication by the hessian. Neural computation, 6(1):147 160, 1994.

Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951.

Sethna, J. Statistical mechanics: entropy, order parameters, and complexity, volume 14. Oxford University Press, 2006.

Sharif Razavian, A., Azizpour, H., Sullivan, J., and Carlsson, S. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 806 813, 2014.

Shwartz-Ziv, R. and Tishby, N. Opening the black box of deep neural networks via information. ar Xiv:1703.00810, 2017.

Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. ar Xiv preprint physics/0004057, 2000.

Ver Steeg, G. and Galstyan, A. Maximally informative hierarchical representations of high-dimensional data. In Artiﬁcial Intelligence and Statistics, pp. 1004 1012, 2015.

Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., and Savarese, S. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3712 3722, 2018.