# learning_infinite_rbms_with_frankwolfe__b19f44dc.pdf

Learning Inﬁnite RBMs with Frank-Wolfe

Wei Ping Qiang Liu Alexander Ihler

Computer Science, UC Irvine Computer Science, Dartmouth College {wping,ihler}@ics.uci.edu qliu@cs.dartmouth.edu

In this work, we propose an inﬁnite restricted Boltzmann machine (RBM), whose maximum likelihood estimation (MLE) corresponds to a constrained convex optimization. We consider the Frank-Wolfe algorithm to solve the program, which provides a sparse solution that can be interpreted as inserting a hidden unit at each iteration, so that the optimization process takes the form of a sequence of ﬁnite models of increasing complexity. As a side beneﬁt, this can be used to easily and efﬁciently identify an appropriate number of hidden units during the optimization. The resulting model can also be used as an initialization for typical state-of-the-art RBM training algorithms such as contrastive divergence, leading to models with consistently higher test likelihood than random initialization.

1 Introduction

Restricted Boltzmann machines (RBMs) are two-layer latent variable models that use a layer of hidden units h to model the distribution of visible units v [Smolensky, 1986, Hinton, 2002]. RBMs have been widely used to capture complex distributions in numerous application domains, including image modeling [Krizhevsky et al., 2010], human motion capture [Taylor et al., 2006] and collaborative ﬁltering [Salakhutdinov et al., 2007], and are also widely used as building blocks for deep generative models, such as deep belief networks [Hinton et al., 2006] and deep Boltzmann machines [Salakhutdinov and Hinton, 2009]. Due to the intractability of the likelihood function, RBMs are usually learned using the contrastive divergence (CD) algorithm [Hinton, 2002, Tieleman, 2008], which approximates the gradient of the likelihood using a Gibbs sampler.

One practical problem when using a RBM is that we need to decide the size of the hidden layer (number of hidden units) before performing learning, and it can be challenging to decide what is the optimal size. One simple heuristic is to search the best number of hidden units using cross validation or testing likelihood within a pre-deﬁned candidate set. Unfortunately, this is extremely time consuming; it involves running a full training algorithm (e.g., CD) for each possible size, and thus we can only search over a relatively small set of sizes using this approach.

In addition, because the log-likelihood of the RBM is highly non-convex, its performance is sensitive to the initialization of the learning algorithm. Although random initializations (to relatively small values) are routinely used in practice with algorithms like CD, it would be valuable to explore more robust algorithms that are less sensitive to the initialization, as well as smarter initialization strategies to obtain better results.

In this work, we propose a fast, greedy algorithm for training RBMs by inserting one hidden unit at each iteration. Our algorithm provides an efﬁcient way to determine the size of the hidden layer in an adaptive fashion, and can also be used as an initialization for a full CD-like learning algorithm. Our method is based on constructing a convex relaxation of the RBM that is parameterized by a distribution over the weights of the hidden units, for which the training problem can be framed as a convex functional optimization and solved using an efﬁcient Frank-Wolfe algorithm [Frank and Wolfe, 1956, Jaggi, 2013] that effectively adds one hidden unit at each iteration by solving a relatively fast inner loop optimization.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Related Work Our contributions connect to a number of different themes of existing work within machine learning and optimization. Here we give a brief discussion of prior related work.

There have been a number of works on convex relaxations of latent variable models in functional space, which are related to the gradient boosting method [Friedman, 2001]. In supervised learning, Bengio et al. [2005] propose a convex neural network in which the number of hidden units is unbounded and can be learned, and Bach [2014] analyzes the appealing theoretical properties of such a model. For clustering problems, several works on convex functional relaxation have also been proposed [e.g., Nowozin and Bakir, 2008, Bradley and Bagnell, 2009]. Other forms of convex relaxation have also been developed for two layer latent variable models [e.g., Aslan et al., 2013].

There has also been considerable work on extending directed/hierarchical models into inﬁnite models such that the dimensionality of the latent space can be automatically inferred during learning. Most of these methods are Bayesian nonparametric models, and a brief overview can be found in Orbanz and Teh [2011]. A few directions have been explored for undirected models, particularly RBMs. Welling et al. [2002] propose a boosting algorithm in the feature space of the model; a new feature is added into the RBM at each boosting iteration, instead of a new hidden unit. Nair and Hinton [2010] conceptually tie the weights of an inﬁnite number of binary hidden units, and connect these sigmoid units with noisy rectiﬁed linear units (Re LUs). Recently, Côté and Larochelle [2015] extend an ordered RBM model with inﬁnite number of hidden units, and Nalisnick and Ravi [2015] use the same technique for word embedding. The ordered RBM is sensitive to the ordering of its hidden units and can be viewed as an mixture of RBMs. In contrast, our model incorporates regular RBMs as a special case, and enables model selection for standard RBMs.

The Frank-Wolfe method [Frank and Wolfe, 1956] (a.k.a. conditional gradient) is a classical algorithm to solve constrained convex optimization. It has recently received much attention because it uniﬁes a large variety of sparse greedy methods [Jaggi, 2013], including boosting algorithms [e.g., Beygelzimer et al., 2015], learning with dual structured SVM [Lacoste-Julien et al., 2013] and marginal inference using MAP in graphical models [e.g., Belanger et al., 2013, Krishnan et al., 2015].

Verbeek et al. [2003] proposed a greedy learning algorithm for Gaussian mixture models, which inserts a new component at each step and resembles our algorithm in its procedure. As one beneﬁt, it provides a better initialization for EM than random initialization. Likas et al. [2003] investigate greedy initialization in k-means clustering.

2 Background

A restricted Boltzmann machine (RBM) is an undirected graphical model that deﬁnes a joint distribution over the vectors of visible units v {0, 1}|v| 1 and hidden units h {0, 1}|h| 1,

p(v, h | θ) = 1 Z(θ) exp v Wh + b v ; Z(θ) = X

h exp v Wh + b v , (1)

where |v| and |h| are the dimensions of v and h respectively, and θ := {W, b} are the model parameters including the pairwise interaction term W R|v| |h| and the bias term b R|v| 1 for the visible units. Here we drop the bias term for the hidden units h, since it can be achieved by introducing a dummy visible unit whose value is always one. The partition function Z(θ) serves to normalize the probability to sum to one, and is typically intractable to calculate exactly.

Because RBMs have a bipartite structure, the conditional distributions p(v|h; θ) and p(h|v; θ) are fully factorized and can be calculated in closed form,

p(h|v, θ) =

i=1 p(hi|v), with p(hi = 1|v) = σ v T W i ,

p(v|h, θ) =

j=1 p(vj|h), with p(vj = 1|h) = σ Wj h + bj , (2)

where σ(u) = 1/(1 + exp( u)) is the logistic function, and W i and Wj and are the i-th column and j-th row of W respectively. Eq. (2) allows us to derive an efﬁcient blocked Gibbs sampler that iteratively alternates between drawing v and h.

The marginal log-likelihood of the RBM is

log p(v | θ) =

i=1 log 1 + exp(w i v) + b v log Z(θ), (3)

where wi := W i is the i-th column of W and corresponds to the weights connected to the i-th hidden unit. Because we assume each hidden unit hi takes values in {0, 1}, we get the softplus function log(1 + exp(w i v)) when we marginalize hi. This form shows that the (marginal) free energy of the RBM is a sum of a linear term b v and a set of softplus functions with different weights wi; this provides a foundation for our development.

Given a dataset {vn}N n=1, the gradient of the log-likelihood for each data point vn is log p(vn|θ)

W = Ep(h|vn;θ) vnh Ep(v,h|θ) vh = vn(µn) Ep(v,h|θ) vh , (4)

where µn = σ(W vn) and the logistic function σ is applied in an element-wise manner. The positive part of the gradient can be calculated exactly, since the conditional distribution p(h|vn) is fully factorized. The negative part arises from the derivatives of the log-partition function and is intractable. Stochastic optimization algorithms, such as CD [Hinton, 2002] and persistent CD [Tieleman, 2008], are popular methods to approximate the intractable expectation using Gibbs sampling.

3 RBM with Inﬁnite Hidden Units

In this section, we ﬁrst generalize the RBM model deﬁned in Eq. (3) to a model with an inﬁnite number of hidden units, which can also be viewed as a convex relaxation of the RBM in functional space. Then, we describe the learning algorithm.

3.1 Model Deﬁnition

Our general model is motivated by Eq. (3), in which the ﬁrst term can be treated as an empirical average of the softplus function log(1 + exp(w v)) under an empirical distribution over the weights {wi}. To extend this, we deﬁne a general distribution q(w) over the weight w, and replace the empirical averaging with the expectation under q(w); this gives the following generalization of an RBM with an inﬁnite (possibly uncountable) number of hidden units,

log p(v | q, ϑ) = αEq(w) log(1 + exp(w v)) + b v log Z(q, ϑ), (5)

Z(q, ϑ) = X

v exp αEq(w) log(1 + exp(w v)) + b v ,

where ϑ := {b, α} and α > 0 is a temperature parameter which controls the effective number of hidden units in the model, and Eq(w)[f(w)] := R

w q(w)f(w)dw. Note that q(w) is assumed to be properly normalized, i.e., R

w q(w)dw = 1. Intuitively, (5) deﬁnes a semi-parametric model whose log probability is a sum of a linear bias term parameterized by b, and a nonlinear term parameterized by the weight distribution w and α that controls the magnitude of the nonlinear term. This model can be regarded as a convex relaxation of the regular RBM, as shown in the following result.

Proposition 3.1. The model in Eq. (5) includes the standard RBM (3) as special case by constraining q(w) = 1 |h| P|h| i=1 I(w = wi) and α = |h|. Moreover, the log-likelihood of the model is concave w.r.t. the function q(w), α and b respectively, and is jointly concave with q(w) and b.

We should point out that the parameter α plays a special role in this model: we reduce to the standard RBM only when α equals the number |h| of particles in q(w) = 1 |h| P|h| i=1 I(w = wi), and would otherwise get a fractional RBM. The fractional RBM leads to a more challenging inference problem than a standard RBM, since the standard Gibbs sampler is no longer directly applicable. We discuss this point further in Section 3.3.

Given a dataset {vn}N n=1, we learn the parameters q and ϑ using a penalized maximum likelihood estimator (MLE) that involves a convex functional optimization:

arg max q M, ϑ

n=1 log p(vn | q, ϑ) λ

2 Eq(w)[||w||2] , (6)

where M is the set of valid distributions and we introduce a functional L2 norm regularization Eq(w)[||w||2] to penalize the likelihood for large values of w. Alternatively, we could equivalently optimize the likelihood on MC = {q | q(w) 0 and R

||w||2 C q(w) = 1}, which restricts the probability mass to a 2-norm ball ||w||2 C.

3.2 Learning Inﬁnite RBMs with Frank-Wolfe

It is challenging to directly solve the optimization in Eq. (6) by standard gradient descent methods, because it involves optimizing the function q(w) with inﬁnite dimensions. Instead, we propose to solve it using the Frank-Wolfe algorithm [Jaggi, 2013], which is projection-free and provides a sparse solution.

Assume we already have qt at the iteration t; then Frank-Wolfe ﬁnds the qt+1 by maximizing the linearization of the objective function :

qt+1 (1 βt+1)qt + βt+1rt+1, where rt+1 arg max q M q, q L(qt, ϑt) , (7)

where βt+1 [0, 1] is a step size parameter, and the convex combination step guarantees the new qt+1 remains a distribution after the update. A typical step-size is βt = 1/t, in which case we have qt(w) = 1

t Pt s=1 rs(w), that is, qt equals the average of all the earlier solutions obtained by the linear program.

To apply Frank-Wolfe to solve our problem, we need to calculate the functional gradient q L(qt, ϑt) in E.q. (7). We can show that (see details in Appendix),

q L(qt, ϑt) = λ

2 ||w||2 + αt

n=1 log(1 + exp(w vn)) X

v p(v | qt, ϑt) log(1 + exp(w v)) ,

where p(v | qt, ϑt) is the distribution parametrized by the weight density qt(w) and parameter ϑt,

p(v | qt, ϑt) = exp αt Eqt(w)[log(1 + exp(w v))] + b t v

Z(qt, ϑt) . (8)

The (functional) linear program in Eq. (7) is equivalent to an optimization over weight vector w :

max q M q, q L(qt, ϑt) = max q M Eq(w)[ q L(qt, ϑt)]

2 ||w||2 + X

v p(v | qt, ϑt) log(1 + exp(w v)) 1

n=1 log(1 + exp(w vn)) (9)

The gradient of the objective (9) is,

w = λw + Ep(v|qt,ϑt) σ(w v) v 1

n=1 σ(w vn) vn,

where the expectation over p(v | qt, ϑt) can be intractable to calculate, and one may use stochastic optimization and draw samples using MCMC. Note that the second two terms in the gradient enforce an intuitive moment matching condition: the optimal w introduces a set of importance weights σ(w v) that adjust the empirical data and the previous model, such that their moments match with each other.

Now, suppose w t is the optimum of Eq. (9) at iteration t, the item rt(w) we added can be shown to be the delta over w t , that is, rt(w) = I(w = w t ); in addition, we have qt(w) = 1

t Pt s=1 I(w = w s) when the step size is taken to be βt = 1 t . Therefore, this Frank-Wolfe update can be naturally interpreted as greedily inserting a hidden unit into the current model p(v | qt, ϑt). In particular, if we update the temperature parameter as αt t, according to Proposition 3.1, we can directly transform our model p(v | qt, ϑt) to a regular RBM after each Frank-Wolfe step, which enables the convenient blocked Gibbs sampling for inference.

Compared with the (regularized) MLE of the standard RBM (e.g. in Eq. (4)), the optimization in Eq. (9) has the following nice properties: (1) The current model p(v | qt, ϑt) does not depend on

Algorithm 1 Frank-Wolfe Learning Algorithm

Input: training data {vn}N n=1; step size η; regularization λ. Output: sparse solution q (w), and ϑ

Initialize q0(w) = I(w = w ) at random w ; b0 = 0; α0 = 1; for t = 1 : T [or, stopping criterion] do

Draw sample {vs}S s=1 from p(v | qt 1, ϑt 1);

w t = argminw n λ 2 ||w||2 + 1

S PS s=1 log(1+exp(w vs)) 1

N PN n=1 log(1+exp(w vn)) o ;

Update qt(w) (1 1

t ) qt 1(w) + 1

t I(w = w t );

Update αt t (optional: gradient descent);

Set bt = bt 1; repeat

Draw a mini-batch samples {vm}M m=1 from p(v | qt, ϑt) Update bt bt + η ( 1

N PN n=1 vn 1 M PM m=1 vm) until end for Return q (w) = qt(w); ϑ = {bt, αt};

w, which means we can draw enough samples from p(v | qt, ϑt) at each iteration t, and reuse them during the optimization of w. (2) The objective function in Eq. (9) can be evaluated explicitly given a set of samples, and hence efﬁcient off-the-shelf optimization tools such as L-BFGS can be used to solve the optimization very efﬁciently. (3) Each iteration of our method involves much fewer parameters (only the weights for a single hidden unit, which is |v| 1 instead of the full |v| |h| weight matrix are updated), and hence deﬁnes a series of easier problems that can be less sensitive to initialization. We note that a similar greedy learning strategy has been successfully applied for learning mixture models [Verbeek et al., 2003], in which one greedily inserts a component at each step, and that this approach can provide better initialization for EM optimization than using multiple random initializations.

Once we obtain qt+1, we can update the bias parameter bt by gradient descent,

b L(qt+1, ϑt) = 1

v p(v|qt+1, ϑt)v. (10)

One can further optimize αt by gradient descent,1 but we ﬁnd simply updating αt t is more efﬁcient and works well in practice. We summarize our Frank-Wolfe learning algorithm in Algorithm 1.

Adding hidden units on RBM. Besides initializing q(w) to be a delta function at some random w and learning the model from scratch, one can also adapt Algorithm 1 to incrementally add hidden units into an existing RBM in Eq. (3) (e.g. have been learned by CD). According to Proposition 3.1, one can simply initialize qt(w) = 1 |h| P|h| i=1 I(w = wi), αt = |h|, and continue the Frank-Wolfe iterations at t = |h| + 1.

Removing hidden units. Since the hidden units are added in a greedy manner, one may want to remove an old hidden unit during the Frank-Wolfe learning, provided it is bad with respect to our objective Eq. (9) after more hidden units have been added. A variant of Frank-Wolfe with away-steps [Guélat and Marcotte, 1986] ﬁts this requirement and can be directly applied. As shown by [Clarkson, 2010], it can improve the sparsity of the ﬁnal solution (i.e., fewer hidden units in the learned model).

3.3 MCMC Inference for Fractional RBMs

As we point out in Section 3.1, we need to take α equal to the number of particles in q(w) (that is, αt t in Algorithm 1) in order to have our model reduce to the standard RBM. If α takes a more general real number, we obtain a more general fractional RBM model, for which inference is

1see Appendix for the deﬁnition of αL(qt, ϑt)

more challenging because the standard block Gibbs sampler is not directly applicable. In practice, we ﬁnd that setting αt t to correspond to a regular RBM seems to give the best performance, but for completeness, we discuss the fractional RBM in more detail in this section, and propose a Metropolis-Hastings algorithm to draw samples from the fractional RBM. We believe that this fractional RBM framework provides an avenue for further improvements in future work.

To frame the problem, let us assume αq(w) = P

i ci I(w = wi), where ci is a general real number; the corresponding model is

log p(v | q, ϑ) = X

i ci log(1 + exp(w i v)) + b v log Z(q, ϑ), (11)

which differs from the standard RBM in (3) because each softplus function is multiplied by ci. Nevertheless, one may push the ci into the softplus function, and obtain a standard RBM that forms an approximation of (11):

log ep(v | q, ϑ) = X

i log(1 + exp(ci w i v)) + b v log e Z(q, ϑ). (12)

This approximation can be justiﬁed by considering the special case when the magnitude of the weights w is very large, so that the softplus function essentially reduces to a Re LU function, that is, log(1 + exp(w i v)) max(0, w i v). In this case, (11) and (12) become equivalent because ci max(0, x) = max(0, cix). More concretely, we can guarantee the following bound:

Proposition 3.2. For any 0 < ci 1,we have

1 21 ci (1 + exp(ci w i v)) (1 + exp(w i v))ci 1 + exp(ci w i v).

The proof can be found in the Appendix. Note that we apply the bound when ci > 1 by splitting ci into the sum of its integer part and fractional remainder, and apply the bound to the fractional part.

Therefore, the fractional RBM (11) can be well approximated by the standard RBM (12), and this can be leveraged to design an inference algorithm for (11). As one example, we can use the Gibbs update of (12) as a proposal for a Metropolis-Hastings update for (11). To be speciﬁc, given a conﬁguration v, we perform Gibbs update in RBM ep(v | q, ϑ) to get v , and accept it with probability min(1, A(v v )),

A(v v ) = p(v ) e T(v v)

p(v) e T(v v ) ,

where e T(v v ) is the Gibbs transition of RBM ep(v | q, ϑ). Because the acceptance probability of

a Gibbs sampler equals one, we have ep(v) e T (v v ) ep(v ) e T (v v) = 1 . This gives

A(v v ) = p(v )ep(v)

p(v)ep(v ) = Q

i(1 + exp(w i v ))ci Q

i(1 + exp(ci w i v)) Q

i(1 + exp(w i v))ci Q

i(1 + exp(ci w i v )) .

4 Experiments

In this section, we test the performance of our Frank-Wolfe (FW) learning algorithm on two datasets: MNIST [Le Cun et al., 1998] and Caltech101 Silhouettes [Marlin et al., 2010]. The MNIST handwritten digits database contains 60,000 images in the training set and 10,000 test set images, where each image vn includes 28 28 pixels and is associated with a digit label yn. We binarize the grayscale images by thresholding the pixels at 127, and randomly select 10,000 images from training as the validation set. The Caltech101 Silhouettes dataset [Marlin et al., 2010] has 8,671 images with 28 28 binary pixels, where each image represents objects silhouette and has a class label (overall 101 classes). The dataset is divided into three subsets: 4,100 examples for training, 2,264 for validation and 2,307 for testing.

0 100 200 300 400 500 600 700

Number of hidden units

Avg. test log likelihood

FW CD(rand init.) CD(FW init.)

100 200 300 400 500 600 700

Number of hidden units

Avg. test log likelihood

FW CD(rand init.) CD(FW init.)

(a) MNIST (b) Caltech101 Silhouettes

Figure 1: Average test log-likelihood on the two datasets as we increase the number of hidden units. We can see that FW can correctly identify an appropriate hidden layer size with high test log-likelihood (marked by the green dashed line). In addition, CD initialized by FW gives higher test likelihood than random initialization for the same number of hidden units. Best viewed in color.

Training algorithms We train RBMs with CD-10 algorithm. 2 A ﬁxed learning rate is selected from the set {0.05, 0.02, 0.01, 0.005} using the validation set, and the mini-batch size is selected from the set {10, 20, 50, 100, 200}. We use 200 epochs for training on MINIST and 400 epochs on Caltech101. Early stopping is applied by monitoring the difference of average log-likelihood between training and validation data, so that the intractable log-partition function is cancelled [Hinton, 2010]. We train RBMs with {20, 50, 100, 200, 300, 400, 500, 600, 700} hidden units. We incrementally train a RBM model by Frank-Wolfe (FW) algorithm 1. A ﬁxed step size η is selected from the set {0.05, 0.02, 0.01, 0.005} using the validation data, and a regularization strength λ is selected from the set {1, 0.5, 0.1, 0.05, 0.01}. We set T = 700 in Algorithm 1, and use the same early stopping criterion as CD. We randomly initialize the CD algorithm 5 times and select the best one on the validation set; meanwhile, we also initialize CD by the model learned from Frank-Wolfe.

Test likelihood To evaluate the test likelihood of the learned models, we estimate the partition function using annealed importance sampling (AIS) [Salakhutdinov and Murray, 2008]. The temperature parameter is selected following the standard guidance: ﬁrst 500 temperatures spaced uniformly from 0 to 0.5, and 4,000 spaced uniformly from 0.5 to 0.9, and 10,000 spaced uniformly from 0.9 to 1.0; this gives a total of 14,500 intermediate distributions. We summarize the averaged test log-likelihood of MNIST and Caltech101 Silhouettes in Figure 1, where we report the result averaged over 500 AIS runs in all experiments, with the error bars indicating the 3 standard deviations of the estimations.

We evaluate the test likelihood of the model in FW after adding every 20 hidden units. We perform early stopping when the gap of average log-likelihood between training and validation data largely increases. As shown in Figure 1, this procedure selects 460 hidden units on MNIST (as indicated by the green dashed lines), and 550 hidden units on Caltech101; purely for illustration purposes, we continue FW in the experiment until reaching T = 700 hidden units. We can see that the identiﬁed number of hidden units roughly corresponds to the maximum of the test log-likelihood of all the three algorithms, suggesting that FW can identify the appropriate number of hidden units during the optimization.

We also use the model learned by FW as an initialization for CD (the blue lines in Figure 2), and ﬁnd it consistently performs better than the best result of CD with 5 random initializations. In our implementation, the running time of the FW procedure is at most twice as CD for the same number of hidden units. Therefore, FW initialized CD provides a practical strategy for learning RBMs: it requires approximately three times of computation time as a single run of CD, while simultaneously identifying the proper number of hidden units and obtaining better test likelihood.

2CD-k refers to using k-step Gibbs sampler to approximate the gradient of the log-partition function.

300 400 500 600 700 2

Number of hidden units

Test error (%)

FW CD(rand init.) CD(FW init.)

300 400 500 600 700

Number of hidden units

Test error (%)

FW CD(rand init.) CD(FW init.)

(a) MNIST (b) Caltech101 Silhouettes

Figure 2: Classiﬁcation error when using the learned hidden representations as features.

Classiﬁcation The performance of our method is further evaluated using discriminant image classiﬁcation tasks. We take the hidden units activation vectors Ep(h|vn)[h] generated by the three algorithms in Figure 1 and use it as the feature in a multi-class logistic regression on the class labels yn in MNIST and Caltech101. From Figure 2, we ﬁnd that our basic FW tends to be worse than the fully trained CD (best in 5 random initializations) when only small numbers of hidden units are added, but outperforms CD when more hidden units (about 450 in both cases) are added. Meanwhile, the CD initialized by FW outperforms CD using the best of 5 random initializations.

5 Conclusion

In this work, we propose a convex relaxation of the restricted Boltzmann machine with an inﬁnite number of hidden units, whose MLE corresponds to a constrained convex program in a function space. We solve the program using Frank-Wolfe, which provides a sparse greedy solution that can be interpreted as inserting a single hidden unit at each iteration. Our new method allows us to easily identify the appropriate number of hidden units during the progress of learning, and can provide an advanced initialization strategy for other state-of-the-art training methods such as CD to achieve higher test likelihood than random initialization.

Acknowledgements

This work is sponsored in part by NSF grants IIS-1254071 and CCF-1331915. It is also funded in part by the United States Air Force under Contract No. FA8750-14-C-0011 under the DARPA PPAML program.

Derivation of gradients The functional gradient of L(q, ϑ) w.r.t. the density function q(w) is

q L(q, ϑ) = λ

2 ||w||2 + α 1

n=1 log(1 + exp(w vn))

v exp αEq(w)[log(1 + exp(w v))] + b v log(1 + exp(w v)) Z(q, b, α)

2 ||w||2 + α 1

n=1 log(1 + exp(w vn)) X

v p(v | q, ϑ) log(1 + exp(w v)) .

The gradient of L(q, ϑ) w.r.t. the temperature parameter α is

αL(q, ϑ) = 1

n=1 Eq(w) log(1 + exp(w vn)) X

v p(v | q, ϑ) Eq(w) log(1 + exp(w v)) .

Proof of Proposition 4.2

Proof. For any 0 < c 1, we have following classical inequality, X

k xc k)1/c, and 1

Let x1 = 1 and x2 = exp(w i v), and the proposition is a direct result of above two inequalities.

Ö. Aslan, H. Cheng, X. Zhang, and D. Schuurmans. Convex two-layer modeling. In NIPS, 2013. F. Bach. Breaking the curse of dimensionality with convex neural networks. ar Xiv:1412.8690, 2014. D. Belanger, D. Sheldon, and A. Mc Callum. Marginal inference in MRFs using Frank-Wolfe. In NIPS Workshop on Greedy Optimization, Frank-Wolfe and Friends, 2013. Y. Bengio, N. L. Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In NIPS, 2005. A. Beygelzimer, E. Hazan, S. Kale, and H. Luo. Online gradient boosting. In NIPS, 2015. D. M. Bradley and J. A. Bagnell. Convex coding. In UAI, 2009. K. L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transactions on Algorithms, 2010. M.-A. Côté and H. Larochelle. An inﬁnite restricted Boltzmann machine. Neural Computation, 2015. M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956. J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 2001. J. Guélat and P. Marcotte. Some comments on Wolfe s away step . Mathematical Programming, 1986. G. Hinton. A practical guide to training restricted Boltzmann machines. UTML TR, 2010. G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 2006. G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 2002. M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013. R. G. Krishnan, S. Lacoste-Julien, and D. Sontag. Barrier Frank-Wolfe for marginal inference. In NIPS, 2015. A. Krizhevsky, G. E. Hinton, et al. Factored 3-way restricted Boltzmann machines for modeling natural images. In AISTATS, 2010. S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe optimization for structural SVMs. In ICML, 2013. Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. A. Likas, N. Vlassis, and J. J. Verbeek. The global k-means clustering algorithm. Pattern recognition, 2003. B. M. Marlin, K. Swersky, B. Chen, and N. D. Freitas. Inductive principles for restricted Boltzmann machine learning. In AISTATS, 2010. V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted Boltzmann machines. In ICML, 2010. E. Nalisnick and S. Ravi. Inﬁnite dimensional word embeddings. ar Xiv:1511.05392, 2015. S. Nowozin and G. Bakir. A decoupled approach to exemplar-based unsupervised learning. In ICML, 2008. P. Orbanz and Y. W. Teh. Bayesian nonparametric models. In Encyclopedia of Machine Learning. 2011. R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In AISTATS, 2009. R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In ICML, 2008. R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative ﬁltering. In ICML, 2007. P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, DTIC Document, 1986. G. W. Taylor, G. E. Hinton, and S. Roweis. Modeling human motion using binary latent variables. In NIPS, 2006. T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML, 2008. J. J. Verbeek, N. Vlassis, and B. Kröse. Efﬁcient greedy learning of Gaussian mixture models. Neural Computation, 2003. M. Welling, R. S. Zemel, and G. E. Hinton. Self supervised boosting. In NIPS, 2002.