# a_theory_of_generative_convnet__ee386756.pdf

A Theory of Generative Conv Net

Jianwen Xie JIANWEN@UCLA.EDU Yang Lu YANGLV@UCLA.EDU Song-Chun Zhu SCZHU@STAT.UCLA.EDU Ying Nian Wu YWU@STAT.UCLA.EDU Department of Statistics, University of California, Los Angeles, CA, USA

We show that a generative random ﬁeld model, which we call generative Conv Net, can be derived from the commonly used discriminative Conv Net, by assuming a Conv Net for multicategory classiﬁcation and assuming one of the categories is a base category generated by a reference distribution. If we further assume that the non-linearity in the Conv Net is Rectiﬁed Linear Unit (Re LU) and the reference distribution is Gaussian white noise, then we obtain a generative Conv Net model that is unique among energy-based models: The model is piecewise Gaussian, and the means of the Gaussian pieces are deﬁned by an auto-encoder, where the ﬁlters in the bottom-up encoding become the basis functions in the top-down decoding, and the binary activation variables detected by the ﬁlters in the bottom-up convolution process become the coefﬁcients of the basis functions in the top-down deconvolution process. The Langevin dynamics for sampling the generative Conv Net is driven by the reconstruction error of this autoencoder. The contrastive divergence learning of the generative Conv Net reconstructs the training images by the auto-encoder. The maximum likelihood learning algorithm can synthesize realistic natural image patterns.

1. Introduction

The convolutional neural network (Conv Net or CNN) (Le Cun et al., 1998; Krizhevsky et al., 2012) has proven to be a tremendously successful discriminative or predictive learning machine. Can the discriminative Conv Net be turned into a generative model and an unsupervised learning ma-

Equal contributions. Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

chine? It would be highly desirable if this can be achieved because generative models and unsupervised learning can be very useful when the training datasets are small or the labeled data are scarce. It would also be extremely satisfying from a conceptual point of view if both discriminative classiﬁer and generative model, and both supervised learning and unsupervised learning, can be treated within a uniﬁed framework for Conv Net.

In this conceptual paper, we show that a generative random ﬁeld model, which we call generative Conv Net, can be derived from the commonly used discriminative Conv Net, by assuming a Conv Net for multi-category classiﬁcation and assuming one of the categories is a base category generated by a reference distribution. The model is in the form of exponential tilting of the reference distribution, where the exponential tilting is deﬁned by the Conv Net scoring function. If we further assume that the non-linearity in the Conv Net is Rectiﬁed Linear Unit (Re LU) (Krizhevsky et al., 2012) and that the reference distribution is Gaussian white noise, then we obtain a generative Conv Net model that is unique among energy-based models: The model is piecewise Gaussian, and the means of the Gaussian pieces are deﬁned by an auto-encoder, where the ﬁlters in the bottomup encoding become the basis functions in the top-down decoding, and the binary activation variables detected by the ﬁlters in the bottom-up convolution process become the coefﬁcients of the basis functions in the top-down deconvolution process (Zeiler & Fergus, 2014). The Langevin dynamics for sampling the generative Conv Net is driven by the reconstruction error of this auto-encoder. The contrastive divergence learning (Hinton, 2002) of the generative Conv Net reconstructs the training images by the autoencoder. The maximum likelihood learning algorithm can synthesize realistic natural image patterns.

The main purpose of our paper is to explore the properties of the generative Conv Net, such as being piecewise Gaussian with auto-encoding means, where the ﬁlters in the bottom-up operation take up a new role as the basis functions in the top-down representation. Such an inter-

A Theory of Generative Conv Net

nal representational structure is essentially unique among energy-based models. They are the results of the marriage between the piecewise linear structure of the Re LU Conv Net and the Gaussian white noise reference distribution. The auto-encoder we elucidate is a harmonious fusion of bottom-up convolution and top-down deconvolution.

The reason we choose Gaussian white noise as the reference distribution is that it is the maximum entropy distribution with given marginal variance. Thus it is the most featureless distribution. Another justiﬁcation for the Gaussian white noise distribution is that it is the limiting distribution if we zoom out any stochastic texture pattern, due to the central limit theorem (Wu et al., 2008). The Conv Net seeks to search for non-Gaussian features in order to recover the non-Gaussian distribution before the central limit theorem takes effect. Without the Gaussian reference distribution that contributes the ℓ2 norm term to the energy function, we will not have the auto-encoding form in the internal representation of the model. In fact, without Gaussian reference distribution, the density of the model is not even integrable. The Gaussian reference distribution is also crucial for the Langevin dynamics to be driven by the reconstruction error.

The Re LU is the most commonly used non-linearity in modern Conv Net (Krizhevsky et al., 2012). It makes the Conv Net scoring function piecewise linear (Montufar et al., 2014). Without the piecewise linearity of the Conv Net, we will not have the piecewise Gaussian form of the model. The Re LU is the source of the binary activation variables because the Re LU function can be written as max(0, r) = 1(r > 0) r, where 1(r > 0) is the binary variable indicating whether r > 0 or not. The binary variable can also be derived from the derivative of the Re LU function. The piecewise linearity is also crucial for the exact equivalence between the gradient of the contrastive divergence learning and the gradient of the auto-encoder reconstruction error.

2. Related Work

The model in the form of exponential tilting of a reference distribution where the exponential tilting is deﬁned by Conv Net was ﬁrst proposed by (Dai et al., 2015). They did not study the internal representational structure of the model. (Lu et al., 2016) proposed to learn the FRAME (Filters, Random ﬁeld, And Maximum Entropy) models (Zhu et al., 1997) based on the pre-learned ﬁlters of existing Conv Nets. They did not learn the models from scratch. The hierarchical energy-based models (Le Cun et al., 2006) were studied by the pioneering work of (Hinton et al., 2006b) and (Ngiam et al., 2011). Their models do not correspond directly to modern Conv Net and do not posses the internal representational structure of the generative Conv Net.

The generative Conv Net model can be viewed as a hierarchical version of the FRAME model (Zhu et al., 1997), as well as the Product of Experts (Hinton, 2002; Teh et al., 2003) and Field of Experts (Roth & Black, 2005) models. These models do not have explicit Gaussian white noise reference distribution. Thus they do not have the internal auto-encoding representation, and the ﬁlters in these models do not play the role of basis functions. In fact, a main motivation for this paper is to reconcile the FRAME model (Zhu et al., 1997), where the Gabor wavelets play the role of bottom-up ﬁlters, and the Olshausen-Field model (Olshausen & Field, 1997), where the wavelets play the role of top-down basis functions. The generative Conv Net may be seen as one step towards achieving this goal. See also (Xie et al., 2014).

The relationship between energy-based model with latent variables and auto-encoder was discovered by (Vincent, 2011) and (Swersky et al., 2011) via the score matching estimator (Hyv arinen, 2005). This connection requires that the free energy can be calculated analytically, i.e., the latent variables can be integrated out analytically. This is in general not the case for deep energy-based models with multiple layers of latent variables, such as deep Boltzmann machine with two layers of hidden units (Salakhutdinov & Hinton, 2009). In this case, one cannot obtain an explicit auto-encoder. In fact, for such models, the inference of the latent variables is in general intractable. In generative Conv Net, the multiple layers of binary activation variables come from the Re LU units, and the means of the Gaussian pieces are always deﬁned by an explicit hierarchical autoencoder.

Compared to hierarchical models with explicit binary latent variables such as those based on the Boltzmann machine (Hinton et al., 2006a; Salakhutdinov & Hinton, 2009; Lee et al., 2009), the generative Conv Net is directly derived from the discriminative Conv Net. Our work seems to suggest that in searching for generative models and unsupervised learning machines, we need to look no further beyond the Conv Net.

3. Generative Conv Net

To ﬁx notation, let I(x) be an image deﬁned on the square (or rectangular) image domain D, where x = (x1, x2) indexes the coordinates of pixels. We can treat I(x) as a twodimensional function deﬁned on D. We can also treat I as a vector if we ﬁx an ordering for the pixels. For a ﬁlter F, let F I denote the ﬁltered image or feature map, and let [F I](x) denote the ﬁlter response or feature at position x.

A Conv Net is a composition of multiple layers of linear ﬁltering and element-wise non-linear transformation as ex-

A Theory of Generative Conv Net

pressed by the following recursive formula:

[F (l) k I](x) = h

y Sl w(l,k) i,y [F (l 1) i I](x + y) + bl,k

(1) where l {1, 2, ..., L} indexes the layer. {F (l) k , k = 1, ..., Nl} are the ﬁlters at layer l, and {F (l 1) i , i = 1, ..., Nl 1} are the ﬁlters at layer l 1. k and i are used to index ﬁlters at layers l and l 1 respectively, and Nl and Nl 1 are the numbers of ﬁlters at layers l and l 1 respectively. The ﬁlters are locally supported, so the range of y is within a local support Sl (such as a 7 7 image patch). At the bottom layer, [F (0) k I](x) = Ik(x), where k {R, G, B} indexes the three color channels. Sub-sampling may be implemented so that in [F (l) k I](x), x Dl D. For notational simplicity, we do not make local max pooling explicit in (1).

We take h(r) = max(r, 0), the Rectiﬁed Linear Unit (Re LU), that is commonly adopted in modern Conv Net (Krizhevsky et al., 2012).

Let (F (L) k ) be the top layer ﬁlters. The ﬁltered images are usually 1 1 due to repeated sub-sampling. Suppose there are C categories. For category c {1, ..., C}, the scoring function for classiﬁcation is

k=1 wc,k[F (L) k I], (2)

where wc,k are the category-speciﬁc weight parameters for classiﬁcation.

Deﬁnition 1 Discriminative Conv Net: We deﬁne the following conditional distribution as the discriminative Conv Net:

p(c|I; w) = exp[fc(I; w) + bc] PC c=1 exp [fc(I; w) + bc] . (3)

where bc is the bias term, and w collects all the weight and bias parameters at all the layers.

The discriminative Conv Net is a multinomial logistic regression (or soft-max) that is commonly used for classiﬁcation (Le Cun et al., 1998; Krizhevsky et al., 2012).

Deﬁnition 2 Generative Conv Net (fully connected version): We deﬁne the following random ﬁeld model as the fully connected version of generative Conv Net:

p(I|c; w) = pc(I; w) = 1 Zc(w) exp[fc(I; w)]q(I), (4)

where q(I) is a reference distribution or the null model, assumed to be Gaussian white noise in this paper. Z(w) = Eq{exp[fc(I; w)]} is the normalizing constant.

In (4), pc(I; w) is obtained by the exponential tilting of q(I), and is the conditional distribution of image given category, p(I|c, w). The model was ﬁrst proposed by (Dai et al., 2015).

Proposition 1 Generative and discriminative Conv Nets can be derived from each other:

(a) Let ρc be the prior probability of category c, if p(I|c; w) = pc(I; w) is deﬁned according to model (4), then p(c|I; w) is given by model (3), with bc = log ρc log Zc(w) + constant.

(b) Suppose a base category c = 1 is generated by q(I), and suppose we ﬁx the scoring function and the bias term of the base category f1(I; w) = 0, and b1 = 0. If p(c|I; w) is given by model (3), then p(I|c; w) = pc(I; w) is of the form of model (4), with bc = log ρc log ρ1 + log Zc(w).

If we only observe unlabeled data {Im, m = 1, ..., M}, we may still use the exponential tilting form to model and learn from them. A possible model is to learn ﬁlters at a certain convolutional layer L {1, ..., L} of a Conv Net.

Deﬁnition 3 Generative Conv Net (convolutional version): we deﬁne the following Markov random ﬁeld model as the convolutional version of generative Conv Net:

p(I; w) = 1 Z(w) exp

x DL [F (L) k I](x)

where w consists of all the weight and bias terms that deﬁne the ﬁlters (F (L) k , k = 1, ..., K = NL), and q(I) is the Gaussian white noise model.

Model (5) corresponds to the exponential tilting model (4) with scoring function

x DL [F (L) k I](x). (6)

For the rest of the paper, we shall focus on the model (5), but all the results can be easily extended to model (4).

4. A Prototype Model

In order to reveal the internal structure of the generative Conv Net, it helps to start from the simplest prototype model. A similar model was studied by (Xie et al., 2016).

In our prototype model, we assume that the image domain D is small (e.g., 10 10). Suppose we want to learn a dictionary of ﬁlters or basis functions from a set of observed image patches {Im, m = 1, ..., M} deﬁned on D. We denote these ﬁlters or basis functions by (wk, k = 1, ..., K), where each wk itself is an image patch deﬁned on D. Let

A Theory of Generative Conv Net

x D wk(x)I(x) be the inner product between image patches I and wk. It is also the response of I to the linear ﬁlter wk.

Deﬁnition 4 Prototype model: We deﬁne the following random ﬁeld model as the prototype model:

p(I; w) = 1 Z(w) exp

k=1 h( I, wk + bk)

where bk is the bias term, w = (wk, bk, k = 1, ..., K), and h(r) = max(r, 0). q(I) is the Gaussian white noise model,

q(I) = 1 (2πσ2)|D|/2 exp 1

2σ2 ||I||2 , (8)

where |D| counts the number of pixels in the domain D.

Piecewise Gaussian and binary activation variables: Without loss of generality, let us assume σ2 = 1 in q(I). Deﬁne the binary activation variable δk(I; w) = 1 if I, wk + bk > 0 and δk(I; w) = 0 otherwise, i.e., δk(I; w) = 1( I, wk + bk > 0), where 1() is the indicator function. Then h( I, wk + bk) = δk(I; w)( I, wk + bk). The image space is divided into 2K pieces by the K hyper-planes, I, wk + bk = 0, k = 1, ..., K, according to the values of the binary activation variables (δk(I; w), k = 1, ..., K). Consider the piece of image space where δk(I; w) = δk for k = 1, ..., K. Here we abuse the notation slightly where δk {0, 1} on the right hand side denotes the value of δk(I; w). Write δ(I; w) = (δk(I; w), k = 1, ..., K), and δ = (δk, k = 1, ..., K) as an instantiation of δ(I; w). We call δ(I; w) the activation pattern of I. Let A(δ; w) = {I : δ(I; w) = δ} be the piece of image space that consists of images sharing the same activation pattern δ, then the probability density on this piece

p(I; w, δ) exp

k=1 δkbk + I,

k=1 δkwk I 2

k=1 δkwk 2 #

(9) which is N(P k δkwk, 1) restricted to the piece A(δ; w), where the bold font 1 is the identity matrix. The mean of this Gaussian piece seeks to reconstruct images in A(δ; w) via the auto-encoding scheme I δ P

Synthesis via reconstruction: One can sample from p(I; w) in (7) by the Langevin dynamics:

Iτ+1 = Iτ ϵ2

k=1 δk(Iτ; w)wk

+ ϵZτ, (10)

where τ denotes the time step, ϵ denotes the step size, assumed to be sufﬁciently small throughout this paper, and

Zτ N(0, 1). The dynamics is driven by the autoencoding reconstruction error Iτ PK k=1 δk(Iτ; w)wk, where the reconstruction is based on the binary activation variables (δk). This links synthesis to reconstruction.

Local modes are exactly auto-encoding: If the mean of a Gaussian piece is a local energy minimum ˆI, we have the exact auto-encoding ˆI = PK k=1 δk(ˆI; w)wk. The encoding process is bottom-up and infers δk = δk(ˆI; w) = 1( ˆI, wk + bk > 0). The decoding process is top-down and reconstructs ˆI = P

k δkwk. In the encoding process, wk plays the role of ﬁlter. In the decoding process, wk plays the role of basis function.

5. Internal Structure of Generative Conv Net

In order to generalize the prototype model (7) to the generative Conv Net (5), we only need to add two elements: (1) Horizontal unfolding: make the ﬁlters (wk) convolutional. (2) Vertical unfolding: make the ﬁlters (wk) multi-layer or hierarchical. The results we have obtained for the prototype model can be unfolded accordingly.

To derive the internal representational structure of the generative Conv Net, the key is to write the scoring function f(I; w) as a linear function α + I, B on each piece of image space with ﬁxed activation pattern. Combined with the I 2/2 term from q(I), the energy function will be quadratic, i.e., I 2/2 I, B α, and the probability distribution will be truncated Gaussian with B being the mean. In order to write f(I; w) = α + I, B for ﬁxed activation pattern, we shall use vector notation for Conv Net, and derive B by a top-down deconvolution process. B can also be obtained by f(I; w)/ I via back-propagation computation.

For ﬁlters at level l, the Nl ﬁlters are denoted by the compact notation F(l) = (F (l) k , k = 1, ..., Nl). The Nl ﬁltered images or feature maps are denoted by the compact notation F(l) I = (F (l) k I, k = 1, ..., Nl). F(l) I is a 3D image, containing all the Nl ﬁltered images at layer l. In vector notation, the recursive formula (1) of Conv Net ﬁlters can be rewritten as

[F (l) k I](x) = h w(l) k,x, F(l 1) I + bl,k , (11)

where w(l) k,x matches the dimension of F(l 1) I, which is a 3D image containing all the Nl 1 ﬁltered images at layer l 1. Speciﬁcally,

w(l) k,x, F(l 1) I =

y Sl w(l,k) i,y [F (l 1) i I](x + y).

(12) The 3D basis functions (w(l) k,x) are locally supported, and they are spatially translated copies for different positions x,

A Theory of Generative Conv Net

i.e., w(l) k,x,i(x + y) = w(l,k) i,y , for i {1, ..., Nl 1}, x Dl and y Sl. For instance, at layer l = 1, w(1) k,x is a Gaborlike wavelet of type k centered at position x.

Deﬁne the binary activation variable

δ(l) k,x(I; w) = 1 w(l) k,x, F(l 1) I + bl,k > 0 . (13)

According to (1), we have the following bottom-up process:

[F (l) k I](x) = δ(l) k,x(I; w) w(l) k,x, F(l 1) I + bl,k . (14)

Let δ(I; w) = (δ(l) k,x(I; w), k, x, l) be the activation pattern at all the layers. The active pattern δ(I; w) can be computed in the bottom-up process (13) and (14) of Conv Net.

For the scoring function f(I; w) = PK k=1 P

x DL[F (L) k I](x) deﬁned in (6) for the generative Conv Net, we can write it in terms of lower layers (l L) of ﬁlter responses:

f(I; w) = αl + B(l), F(l) I

x Dl B(l) k (x)[F (l) k I](x), (15)

where B(l) = (B(l) k (x), k = 1, ..., Nl, x Dl) consists of the maps of the coefﬁcients at layer l. B(l) matches the dimension of F(l) I. When l = L, B(L) consists of maps of 1 s, i.e., B(L) k (x) = 1 for k = 1, ..., K = NL and x DL. According to equations (14) and (15), we have the following top-down process:

x Dl B(l) k (x)δ(l) k,x(I; w)w(l) k,x, (16)

where both B(l 1) and w(l) k,x match the dimension of F(l 1) I. Equation (16) is a top-down deconvolution process, where B(l) k (x)δ(l) k,x serves as the coefﬁcient of the

basis function w(l) k,x. The top-down deconvolution process (16) is similar to but subtly different from that in (Zeiler & Fergus, 2014), because equation (16) is controlled by the multiple layers of activation variables δ(l) k,x computed in the

bottom-up process of the Conv Net. Speciﬁcally, δ(l) k,x turns

on or off the basis function w(l) k,x, while δ(l) k,x is determined

by F (l) k . The recursive relationship for αl can be similarly derived.

In the bottom-up convolution process (14), (w(l) k,x) serve as

ﬁlters. In the top-down deconvolution process (16), (w(l) k,x) serve as basis functions.

Let B = B(0), α = α0. Since F(0) I = I, we have f(I; w) = α + I, B . Note that B depends on the activation pattern δ(I; w) = (δ(l) k,x(I; w), k, x, l), as well as w that collects the weight and bias parameters at all the layers.

On the piece of image space A(δ; w) = {I : δ(I; w) = δ} of a ﬁxed activation pattern (again we slightly abuse the notation where δ = (δ(l) k,x {0, 1}, k, x, l) denotes an instantiation of the activation pattern), B and α depend on δ and w. To make this dependency explicit, we denote B = Bw,δ and α = αw,δ, thus

f(I; w) = αw,δ + I, Bw,δ . (17)

See (Montufar et al., 2014) for an analysis of the number of linear pieces.

Theorem 1 Generative Conv Net is piecewise Gaussian: With Re LU h(r) = max(0, r) and Gaussian white noise q(I), p(I; w) of model (5) is piecewise Gaussian. On each piece A(δ; w), the density is N(Bw,δ, 1) truncated to A(δ; w), i.e., Bw,δ is an approximated reconstruction of images in A(δ; w).

Theorem 1 follows from the fact that on A(δ; w),

p(I; w, δ) exp αw,δ + I, Bw,δ I 2

2 I Bw,δ 2 , (18)

which is N(Bw,δ, 1) restricted to A(δ; w).

For each I, the binary activation variables in δ = δ(I; w) are computed by the bottom-up convolution process (13) and (14), and Bw,δ is computed by the top-down deconvolution process (16). Bw,δ seeks to reconstruct images in A(δ; w) via the auto-encoding scheme I δ Bw,δ.

One can sample from p(I; w) of model (5) by the Langevin dynamics:

Iτ+1 = Iτ ϵ2

2 Iτ Bw,δ(Iτ ;w) + ϵZτ, (19)

where Zτ N(0, 1). Again, the dynamics is driven by the auto-encoding reconstruction error I Bw,δ(I;w).

The deterministic part of the Langevin equation (19) deﬁnes an attractor dynamics that converges to a local energy minimum (Zhu & Mumford, 1998).

Proposition 2 The local modes are exactly auto-encoding: Let ˆI be a local maximum of p(I; w) of model (5), then we have exact auto-encoding of ˆI with the following bottom-up and top-down passes:

Bottom-up encoding: δ = δ(ˆI; w);

Top-down decoding: ˆI = Bw,δ. (20)

A Theory of Generative Conv Net

The local energy minima are the means of the Gaussian pieces in Theorem 1, but the reverse is not necessarily true because Bw,δ does not necessarily belong to A(δ; w). But if Bw,δ A(δ; w), then Bw,δ must be a local mode and is exactly auto-encoding.

Proposition 2 can be generalized to general non-linear h(), whereas Theorem 1 is true only for piecewise linear h() such as Re LU.

Proposition 2 is related to the Hopﬁeld network (Hopﬁeld, 1982). It shows that the the Hopﬁeld minima can be represented by a hierarchical auto-encoder.

6. Learning Generative Conv Net

The learning of w from training images {Im, m = 1, ..., M} can be accomplished by maximum likelihood. Let L(w) = PM m=1 log p(I; w)/M, with p(I; w) deﬁned in (5), then

wf(Im; w) Ew

The expectation can be approximated by Monte Carlo samples (Younes, 1999) from the Langevin dynamics (19). See Algorithm 1 for a description of the learning and sampling algorithm.

Algorithm 1 Learning and sampling algorithm

(1) training images {Im, m = 1, ..., M} (2) number of synthesized images M (3) number of Langevin steps L (4) number of learning iterations T Output:

(1) estimated parameters w (2) synthesized images { Im, m = 1, ..., M}

1: Let t 0, initialize w(0) 0. 2: Initialize Im 0, for m = 1, ..., M. 3: repeat 4: For each m, run L steps of Langevin dynamics to update Im, i.e., starting from the current Im, each step follows equation (19). 5: Calculate Hobs = PM m=1 wf(Im; w(t))/M, and

Hsyn = P M m=1 wf( Im; w(t))/ M. 6: Update w(t+1) w(t) +η(Hobs Hsyn), with step size η. 7: Let t t + 1 8: until t = T

If we want to learn from big data, we may use the contrastive divergence (Hinton, 2002) by starting the Langevin

dynamics from the observed images. The contrastive divergence tends to learn the auto-encoder in the generative Conv Net.

Suppose we start from an observed image Iobs, and run a small number of iterations of Langevin dynamics (19) to get a synthesized image Isyn. If both Iobs and Isyn share the same activation pattern δ(Iobs; w) = δ(Isyn; w) = δ, then f(I; w) = aw,δ + I, Bw,δ for both Iobs and Isyn. Then the contribution of Iobs to the learning gradient is

wf(Iobs; w)

wf(Isyn; w) = Iobs Isyn,

w Bw,δ . (22) If Isyn is close to the mean Bw,δ and if Bw,δ is a local mode, then the contrastive divergence tends to reconstruct Iobs by the local mode Bw,δ, because the gradient

w Iobs Bw,δ 2/2 = Iobs Bw,δ,

w Bw,δ . (23)

Hence contrastive divergence learns Hopﬁeld network which memorizes observations by local modes.

We can establish a precise connection for one-step contrastive divergence.

Proposition 3 Contrastive divergence learns to reconstruct: If the one-step Langevin dynamics does not change the activation pattern, i.e., δ(Iobs; w) = δ(Isyn; w) = δ, then the one-step contrastive divergence has an expected gradient that is proportional to the reconstruction gradient:

wf(Iobs; w)

wf(Isyn; w)

w Iobs Bw,δ 2.

This is because Isyn = Iobs ϵ2

2 Iobs Bw,δ +ϵZ, hence EZ Iobs Isyn Iobs Bw,δ, and Proposition 3 follows from (22) and (23).

The contrastive divergence learning updates the bias terms to match the statistics of the activation patterns of Iobs and Isyn, which helps to ensure that δ(Iobs; w) = δ(Isyn; w).

Proposition 3 is related to score matching estimator (Hyv arinen, 2005), whose connection with contrastive divergence based on one-step Langevin was studied by (Hyv arinen, 2007). Our work can be considered a sharpened specialization of this connection, where the piecewise linear structure in Conv Net greatly simpliﬁes the matter by getting rid of the complicated second derivative terms, so that the contrastive divergence gradient becomes exactly proportional to the gradient of the reconstruction error, which is not the case in general score matching estimator. Also, our work gives an explicit hierarchical realization of auto-encoder based sampling (Alain & Bengio, 2014). The connection with Hopﬁed network also appears new.

A Theory of Generative Conv Net

Figure 1. Generating texture patterns. For each category, the ﬁrst image is the training image, and the rest are 2 of the images generated by the learning algorithm.

Figure 2. Generating object patterns. For each category, the ﬁrst row displays 4 of the training images, and the second row displays 4 of the images generated by the learning algorithm.

A Theory of Generative Conv Net

Figure 3. Reconstruction by one-step contrastive divergence. The ﬁrst row displays 4 of the training images, and the second row displays the corresponding reconstructed images.

7. Synthesis and Reconstruction

We show that the generative Conv Net is capable of learning and generating realistic natural image patterns. Such an empirical proof of concept validates the generative capacity of the model. We also show that contrastive divergence learning can indeed reconstruct the observed images, thus empirically validating Proposition 3.

The code in our experiments is based on the Mat Conv Net package of (Vedaldi & Lenc, 2015).

Unlike (Lu et al., 2016), the generative Conv Nets in our experiments are learned from scratch without relying on the pre-learned ﬁlters of existing Conv Nets.

When learning the generative Conv Net, we grow the layers sequentially. Starting from the ﬁrst layer, we sequentially add the layers one by one. Each time we learn the model and generate the synthesized images using Algorithm 1. While learning the new layer of ﬁlters, we also reﬁne the lower layers of ﬁlters by back-propagation.

We use M = 16 parallel chains for Langevin sampling. The number of Langevin iterations between every two consecutive updates of parameters is L = 10. With each new added layer, the number of learning iterations is T = 700. We follow the standard procedure to prepare the training images of size 224 224, whose intensities are within [0, 255], and we subtract the mean image. We ﬁx σ2 = 1 in the reference distribution q(I).

For each of the 3 experiments, we use the same set of parameters for all the categories without tuning.

7.1. Experiment 1: Generating texture patterns

We learn a 3-layer generative Conv Net. The ﬁrst layer has 100 15 15 ﬁlters with sub-sampling size of 3. The second layer has 64 5 5 ﬁlters with sub-sampling size of 1. The third layer has 30 3 3 ﬁlters with sub-sampling size of 1. We learn a generative Conv Net for each category from

a single training image. Figure 1 displays the results. For each category, the ﬁrst image is the training image, and the rest are 2 of the images generated by the learning algorithm.

7.2. Experiment 2: Generating object patterns

Experiment 1 shows clearly that the generative Conv Net can learn from images without alignment. We can also specialize it to learning aligned object patterns by using a single top-layer ﬁlter that covers the whole image. We learn a 4-layer generative Conv Net from images of aligned objects. The ﬁrst layer has 100 7 7 ﬁlters with sub-sampling size of 2. The second layer has 64 5 5 ﬁlters with subsampling size of 1. The third layer has 20 3 3 ﬁlters with sub-sampling size of 1. The fourth layer is a fully connected layer with a single ﬁlter that covers the whole image. When growing the layers, we always keep the top-layer single ﬁlter, and train it together with the existing layers. We learn a generative Conv Net for each category, where the number of training images for each category is around 10, and they are collected from the Internet. Figure 2 shows the results. For each category, the ﬁrst row displays 4 of the training images, and the second row shows 4 of the images generated by the learning algorithm.

7.3. Experiment 3: Contrastive divergence learns to auto-encode

We experiment with one-step contrastive divergence learning on a small training set of 10 images collected from the Internet. The Conv Net structure is the same as in experiment 1. For computational efﬁciency, we learn all the layers of ﬁlters simultaneously. The number of learning iterations is T = 1200. Starting from the observed images, the number of Langevin iterations is L = 1. Figure 3 shows the results. The ﬁrst row displays 4 of the training images, and the second row displays the corresponding auto-encoding reconstructions with the learned parameters.

8. Conclusion

This paper derives the generative Conv Net from the discriminative Conv Net, and reveals an internal representational structure that is unique among energy-based models.

The generative Conv Net has the potential to learn from big unlabeled data, either by contrastive divergence or some reconstruction based methods.

Code and data

The code and training images can be downloaded from the project page: http://www.stat.ucla.edu/ ywu/ Generative Conv Net/main.html

A Theory of Generative Conv Net

Acknowledgement

The code in our work is based on the Matlab code of Mat Conv Net of (Vedaldi & Lenc, 2015). We thank the authors for making their code public.

We thank the three reviewers for their insightful comments that have helped us improve the presentation of the paper. We thank Jifeng Dai and Wenze Hu for earlier collaborations related to this work.

The work is supported by NSF DMS 1310391, ONR MURI N00014-10-1-0933 and DARPA SIMPLEX N66001-15-C4035.

Alain, Guillaume and Bengio, Yoshua. What regularized auto-encoders learn from the data-generating distribution. The Journal of Machine Learning Research, 15(1): 3563 3593, 2014.

Dai, Jifeng, Lu, Yang, and Wu, Ying Nian. Generative modeling of convolutional neural networks. In ICLR, 2015.

Hinton, Geoffrey E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14 (8):1771 1800, 2002.

Hinton, Geoffrey E., Osindero, Simon, and Teh, Yee Whye. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527 1554, 2006a.

Hinton, Geoffrey E, Osindero, Simon, Welling, Max, and Teh, Yee-Whye. Unsupervised discovery of nonlinear structure using contrastive backpropagation. Cognitive Science, 30(4):725 731, 2006b.

Hopﬁeld, John J. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8): 2554 2558, 1982.

Hyv arinen, Aapo. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6:695 709, 2005.

Hyv arinen, Aapo. Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Neural Networks, IEEE Transactions on, 18(5):1529 1531, 2007.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, pp. 1097 1105, 2012.

Le Cun, Yann, Bottou, L eon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document

recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Le Cun, Yann, Chopra, Sumit, Hadsell, Rata, Ranzato, Mare Aurelio, and Huang, Fu Jie. A tutorial on energybased learning. In Predicting Structured Data. MIT Press, 2006.

Lee, Honglak, Grosse, Roger, Ranganath, Rajesh, and Ng, Andrew Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, pp. 609 616. ACM, 2009.

Lu, Yang, Zhu, Song-Chun, and Wu, Ying Nian. Learning FRAME models using CNN ﬁlters. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016.

Montufar, Guido F, Pascanu, Razvan, Cho, Kyunghyun, and Bengio, Yoshua. On the number of linear regions of deep neural networks. In NIPS, pp. 2924 2932, 2014.

Ngiam, Jiquan, Chen, Zhenghao, Koh, Pang Wei, and Ng, Andrew Y. Learning deep energy models. In ICML, 2011.

Olshausen, Bruno A and Field, David J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23):3311 3325, 1997.

Roth, Stefan and Black, Michael J. Fields of experts: A framework for learning image priors. In CVPR, volume 2, pp. 860 867. IEEE, 2005.

Salakhutdinov, Ruslan and Hinton, Geoffrey E. Deep boltzmann machines. In AISTATS, 2009.

Swersky, Kevin, Ranzato, Marc Aurelio, Buchman, David, Marlin, Benjamin, and Freitas, Nando. On autoencoders and score matching for energy based models. In ICML, pp. 1201 1208. ACM, 2011.

Teh, Yee-Whye, Welling, Max, Osindero, Simon, and Hinton, Geoffrey E. Energy-based models for sparse overcomplete representations. The Journal of Machine Learning Research, 4:1235 1260, 2003.

Vedaldi, A. and Lenc, K. Matconvnet convolutional neural networks for matlab. In Proceeding of the ACM Int. Conf. on Multimedia, 2015.

Vincent, Pascal. A connection between score matching and denoising autoencoders. Neural Computation, 23 (7):1661 1674, 2011.

Wu, Ying Nian, Zhu, Song-Chun, and Guo, Cheng-En. From information scaling of natural images to regimes of statistical models. Quarterly of Applied Mathematics, 66:81 122, 2008.

A Theory of Generative Conv Net

Xie, Jianwen, Hu, Wenze, Zhu, Song-Chun, and Wu, Ying Nian. Learning sparse FRAME models for natural image patterns. International Journal of Computer Vision, pp. 1 22, 2014.

Xie, Jianwen, Lu, Yang, Zhu, Song-Chun, and Wu, Ying Nian. Inducing wavelets into random ﬁelds via generative boosting. Journal of Applied and Computational Harmonic Analysis, 41:4 25, 2016.

Younes, Laurent. On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics: An International Journal of Probability and Stochastic Processes, 65(3-4):177 228, 1999.

Zeiler, Matthew D and Fergus, Rob. Visualizing and understanding convolutional neural networks. In ECCV, 2014.

Zhu, Song-Chun and Mumford, David. Grade: Gibbs reaction and diffusion equations. In ICCV, pp. 847 854, 1998.

Zhu, Song-Chun, Wu, Ying Nian, and Mumford, David. Minimax entropy principle and its application to texture modeling. Neural Computation, 9(8):1627 1660, 1997.