# wasserstein_training_of_restricted_boltzmann_machines__3f7e7af5.pdf

Wasserstein Training of Restricted Boltzmann Machines

Grégoire Montavon Technische Universität Berlin gregoire.montavon@tu-berlin.de

Klaus-Robert Müller Technische Universität Berlin klaus-robert.mueller@tu-berlin.de

Marco Cuturi CREST, ENSAE, Université Paris-Saclay marco.cuturi@ensae.fr

Boltzmann machines are able to learn highly complex, multimodal, structured and multiscale real-world data distributions. Parameters of the model are usually learned by minimizing the Kullback-Leibler (KL) divergence from training samples to the learned model. We propose in this work a novel approach for Boltzmann machine training which assumes that a meaningful metric between observations is known. This metric between observations can then be used to deﬁne the Wasserstein distance between the distribution induced by the Boltzmann machine on the one hand, and that given by the training sample on the other hand. We derive a gradient of that distance with respect to the model parameters. Minimization of this new objective leads to generative models with different statistical properties. We demonstrate their practical potential on data completion and denoising, for which the metric between observations plays a crucial role.

1 Introduction

Boltzmann machines [1] are powerful generative models that can be used to approximate a large class of real-world data distributions, such as handwritten characters [9], speech segments [7], or multimodal data [16]. Boltzmann machines share similarities with neural networks in their capability to extract features at multiple scales, and to build well-generalizing hierarchical data representations [15, 13]. The restricted Boltzmann machine (RBM) is a special type of Boltzmann machine composed of one layer of latent variables, and deﬁning a probability distribution pθ(x) over a set of d binary observed variables whose state is represented by the binary vector x {0, 1}d, and with a parameter vector θ to be learned.

Given an empirical probability distribution ˆp(x) = 1

N PN n=1 δxn where (xn)n is a list of N observations in {0, 1}d, an RBM can be trained using information-theoretic divergences (see for example [12]) by minimizing with respect to θ a divergence (ˆp, pθ) between the sample empirical measure ˆp and the modeled distribution pθ. When is for instance the KL divergence, this approach results in the well-known Maximum Likelihood Estimator (MLE), which yields gradients for the θ of the form

θKL(ˆp pθ) = 1

n=1 θ log pθ(xn) = θ log pθ(x)

where the bracket notation p indicates an expectation with respect to p. Alternative choices for are the Bhattacharrya/Hellinger and Euclidean distances between distributions, or more generally

Also with the Department of Brain and Cognitive Engineering, Korea University.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

F-divergences or M-estimators [10]. They all result in comparable gradient terms, that try to adjust θ so that the ﬁtting terms pθ(xn) grow as large as possible.

We explore in this work a different scenario: what if θ is chosen so that pθ(x) is large, on average, when x is close to a data point xn in some sense, but not necessarily when x coincides exactly with xn? To adopt such a geometric criterion, we must ﬁrst deﬁne what closeness between observations means. In almost all applications of Boltzmann machines, such a metric between observations is readily available: One can for example consider the Hamming distance between binary vectors, or any other metric motivated by practical considerations2. This being done, the geometric criterion we have drawn can be materialized by considering for the Wasserstein distance [20] (a.k.a. the Kantorovich or the earth mover s distance [14]) between measures. This choice was considered in theory by [2], who proved its statistical consistency, but was never considered practically to the best of our knowledge. This paper describes a practical derivation for a minimum Kantorovich distance estimator [2] for Boltzmann machines, which can scale up to tens of thousands of observations. As will be described in this paper, recent advances in the fast approximation of Wasserstein distances [5] and their derivatives [6] play an important role in the practical implementation of these computations. Before describing this approach in detail, we would like to insist that measuring goodness-of-ﬁt with the Wasserstein distance results in a considerably different perspective than that provided by a Kullback-Leibler/MLE approach. This difference is illustrated in Figure 1, where a probability pθ can be close from a KL perspective to a given empirical measure ˆp, but far from the same measure p in the Wasserstein sense. Conversely, a different probability pθ can miss the mark from a KL viewpoint but achieve a low Wasserstein distance to ˆp. Before proceeding to the rest of this paper, let us mention that Wasserstein distances have a broad appeal for machine learning. That distance was for instance introduced in the context of supervised inference by [8], who used it to compute a predictive loss between the output of a multilabel classiﬁer against its ground truth, or for domain adaptation, by [4].

small overlap

small distance large distance

large overlap

Data distribution

Model 1 Model 2

Figure 1: Empirical distribution ˆp(x) (gray) deﬁned on the set of states {0, 1}d with d = 3 superposed to two possible models of it deﬁned on the same set of states. The size of the circles indicates the probability mass for each state. For each model, we show its KL and Wasserstein divergences from ˆp(x), and an explanation of why the divergences are low or high: a large/small overlap with ˆp(x), or a large/small distance from ˆp(x).

2 Minimum Wasserstein Distance Estimation

Consider two probabilities p, q in P(X), the set of probabilities on X = {0, 1}d. Namely, two maps p, q in RX + such that P

x q(x) = 1, where we omit x X under the summation sign. Consider a cost function deﬁned on X X, typically a distance D : X X R+. Given a constant γ 0, the γ-smoothed Wasserstein distance [5] is equal to

Wγ(p, q) = min π Π(p,q) D(x, x ) π γH(π), (2)

where Π(p, q) is the set of joint probabilities π on X X such that P

x π(x, x ) = p(x), P

x π(x, x ) = q(x ) and H(π) = P

xx π(x, x ) log π(x, x ) is the Shannon entropy of π. This optimization problem, a strictly convex program, has an equivalent dual formulation [6] which involves instead two real-valued functions α, β in RX and which plays an important role in this paper:

Wγ(p, q) = max α,β RX α(x) p + β(x ) q γ X

xx e 1 γ (α(x)+β(x ) D(x,x )) 1. (3)

2When using the MLE principle, metric considerations play a key role to deﬁne densities pθ, e.g. the reliance of Gaussian densities on Euclidean distances. This is the kind of metric we take for granted in this work.

Smooth Wasserstein Distances The true Wasserstein distance corresponds to the case where γ = 0, that is when Equation (2) is stripped of the entropic term. One can easily verify that that deﬁnition matches the usual linear program used to describe Wasserstein/EMD distances [14]. When γ 0 in Equation (3), one also recovers the Kantorovich dual formulation, because the rightmost regularizer converges to the indicator function of the feasible set of the dual optimal transport problem, α(x) + β(x ) D(x, x ). We consider in this paper the case γ > 0 because it was shown in [5] to considerably facilitate computations, and in [6] to result in a divergence Wγ(p, q) which, unlike the case γ = 0, is differentiable w.r.t to the ﬁrst variable. Looking at the dual formulation in Equation (3), one can see that this gradient is equal to α , the centered optimal dual variable (the centering step for α ensures the orthogonality with respect to the simplex constraint).

Sensitivity analysis gives a clear interpretation to the quantity α (x): It measures the cost for each unit of mass placed by p at x when computing the Wasserstein distance Wγ(p, q). To decrease Wγ(p, q), it might thus be favorable to transfer mass in p from points where α(x) is high to place it on points where α(x) is low. This idea can be used, by a simple application of the chain rule, to minimize, given a ﬁxed target probability p, the quantity Wγ(pθ, p) with respect to θ.

Proposition 1. Let pθ(x) = 1

Z e Fθ(x) be a parameterized family of probability distributions where Fθ(x) is a differentiable function of θ Θ and we write Gθ = θFθ(x) pθ. Let α be the centered optimal dual solution of Wγ(pθ, p) as described in Equation (3). The gradient of the smoothed Wasserstein distance with respect to θ is given by

θWγ(pθ, p) = α (x)

pθGθ α (x) θFθ(x))

Proof. This result is a direct application of the chain rule: We have

θWγ(pθ, p) = pθ

T Wγ(pθ, q)

As mentioned in [6], the rightmost term is the optimal dual variable (the Kantorovich potential) Wγ(pθ, q)/ pθ = α . The Jacobian ( pθ/ θ) is a linear map Θ X. For a given x ,

pθ(x )/ θ = pθ(x )Gθ Fθ(x )pθ(x ).

As a consequence, pθ

θ T α is the integral w.r.t. x of the term above multiplied by α (x ), which results in Equation (4).

Comparison with the KL Fitting Error The target distribution p plays a direct role in the formation of the gradient of KL(ˆp pθ) w.r.t. θ through the term θFθ(x) p in Equation (1). The Wasserstein gradient incorporates the knowledge of p in a different way, by considering, on the support of pθ only, points x that correspond to high potentials (costs) α(x) when computing the distance of pθ to p. A high potential at x means that the probability pθ(x) should be lowered if one were to decrease Wγ(pθ, p), by varying θ accordingly.

Sampling Approximation The gradient in Equation (4) is intractable, since it involves solving an optimal (smoothed) transport problem over probabilities deﬁned on 2d states. In practice, we replace expectations w.r.t pθ by an empirical distribution formed by sampling from the model pθ (e.g. the PCD sample [18]). Given a sample (exn)n of size e N generated by the model, we deﬁne

ˆpθ = P e N n=1 δexn/ e N. The tilde is used to differentiate the sample generated by the model from the empirical observations. Because the dual potential α is centered and ˆpθ is a measure with uniform weights, α (x) ˆpθ = 0 which simpliﬁes the approximation of the gradient to

b θWγ(pθ, ˆp) = 1

n=1 ˆα (exn) θFθ(exn) (5)

where ˆα is the solution of the discrete smooth Wasserstein dual between the two empirical distributions ˆp and ˆpθ, which have respectively supports of size N and e N. In practical terms, ˆα is a vector of size e N, one coefﬁcient for each PCD sample, which can be computed by following the algorithm below [6]. To keep notations simple, we describe it in terms of generic probabilities p and q, having in mind these are in practice the training and simulated empirical measures ˆp and ˆpθ.

Computing α When γ > 0, the optimal variable α corresponding to Wγ(p, q) can be recovered through the Sinkhorn algorithm with a cost which grows as the product |p||q| of the sizes of the support of p and q, where |p| = P

x 1p(x)>0. The algorithm is well known but we adapt it here to our setting, see [6, Alg.3] for a more precise description. To ease notations, we consider an arbitrary ordering of X, a set of cardinal 2d, and identify its elements with indices 1 i 2d. Let I = (i1, , i|p|) be the ordered family of indices in the set {i | p(i) > 0} and deﬁne J accordingly for q. I and J have respective lengths |p| and |q|. Form the matrix K = [e D(i,j)/γ]i I,j J of size |p| and |q|. Choose now two positive vectors u R|p| ++ and v R|q| ++ at random, and repeat until u, v converge in some metric the operations u p/(Kv), v q/(KT u). Upon convergence, the optimal variable α is zero everywhere except for α (ia) = log(ua/ u)/γ where 1 a |p| and u is the geometric mean of vector u (which ensures that α is centered).

3 Application to Restricted Boltzmann Machines

The restricted Boltzmann machine (RBM) is a generative model of binary data that is composed of d binary observed variables and h binary explanatory variables. The vector x {0, 1}d represents the state of observed variables, and the vector y {0, 1}h represents the state of explanatory variables. The RBM associates to each conﬁguration x of observed variables a probability pθ(x) deﬁned as pθ(x) = P

ye Eθ(x,y)/Zθ, where Eθ(x, y) = a T x Ph j=1 yj(w T j x + bj) is called the energy and Zθ = P x,y e Eθ(x,y) is the partition function that normalizes the probability distribution to one. The parameters θ = (a, {wj, bj}h j=1) of the RBM are learned from the data. Knowing the state x of the observed variables, the explanatory variables are independent Bernoulli-distributed with Pr(yj = 1|x) = σ(w T j x + bj), where σ is the logistic map z 7 (1 + e z) 1. Conversely, knowing the state y of the explanatory variables, the observed variables on which the probability distribution is deﬁned can also be sampled independently, leading to an efﬁcient alternate Gibbs sampling procedure for pθ. In this RBM model, explanatory variables can be analytically marginalized, which yields the following probability model: pθ(x) = e Fθ(x)/Z θ,

where Fθ(x) = a T x Ph j=1 log(1 + exp(w T j x + bj)) is the free energy associated to this model and Z θ = P

x e Fθ(x) is the partition function.

Wasserstein Gradient of the RBM Having written the RBM in its free energy form, the Wasserstein gradient can be obtained by computing the gradient of Fθ(x) and injecting it in Equation (5):

b wj Wγ(ˆp, pθ) = α (x) σ(zj) x

where zj = w T j x + bj. Gradients with respect to parameters a and {bj}j can also be obtained by the same means. In comparison, the gradient of the KL divergence is given by

b wj KL(ˆp pθ) = σ(zj) x

ˆpθ σ(zj) x

While the Wasserstein gradient can in the same way as the KL gradient be expressed in a very simple form, the ﬁrst one is not sum-decomposable. A simple manifestation of the non-decomposability occurs for e N = 1 (smallest possible sample size): In that case, α(exn) = 0 due to the centering constraint (see Section 2), thus making the gradient zero.

Stability and KL Regularization Unlike the KL gradient, the Wasserstein gradient depends only indirectly on the data distribution ˆp. This is a problem when the sample ˆpθ generated by the model strongly differs from the examples coming from ˆp, because there is no weighting (α(exn))n of the generated sample that can represent the desired direction in the parameter space Θ. In that case, the Wasserstein gradient will point to a bad local minimum. Closeness between the two empirical samples from this optimization perspective can be ensured by adding a regularization term to the objective that incorporates both the usual quadratic containment term, but also the KL term, that forces proximity to ˆp due to the direct dependence of its gradient on it. The optimization problem becomes:

min θ Θ Wγ(ˆp, pθ) + λ Ω(θ) with Ω(θ) = KL(ˆp pθ) + η ( a 2 + P j wj 2)

starting at point θ0 = arg minθ Θ Ω(θ), and where λ, η are two regularization hyperparameters that must be selected. Determining the starting point θ0 is analogous to having an initial pretraining phase. Thus, the proposed Wasserstein procedure can also be seen as ﬁnetuning a standard RBM, and forcing the ﬁnetuning not to deviate too much from the pretrained solution.

4 Experiments

We perform several experiments that demonstrate that Wasserstein-trained RBMs learn distributions that are better from a metric perspective. First, we explore what are the main characteristics of a learned distribution that optimizes the Wasserstein objective. Then, we investigate the usefulness of these learned models on practical problems, such as data completion and denoising, where the metric between observations occurs in the performance evaluation. We consider three datasets: MNIST-small, a subsampled version of the original MNIST dataset [11] with only the digits 0 retained, a subset of the UCI PLANTS dataset [19] containing the geographical spread of plants species, and MNIST-code, 128-dimensional code vectors associated to each MNIST digit (additional details in the supplement).

4.1 Training, Validation and Evaluation

All RBM models that we investigate are trained in full batch mode, using for ˆpθ the PCD approximation [18] of pθ, where the sample is refreshed at each gradient update by one step of alternate Gibbs sampling, starting from the sample at the previous time step. We choose a PCD sample of same size as the training set (N = e N). The coefﬁcients α1, . . . , α e N occurring in the Wasserstein gradient are obtained by solving the smoothed Wasserstein dual between ˆp and ˆpθ, with smoothing parameter γ = 0.1 and distance D(x, x ) = H(x, x )/ H(x, x ) ˆp, where H denotes the Hamming distance between two binary vectors. We use the centered parameterization of the RBM for gradient descent [13, 3]. We perform holdout validation on the quadratic containment coefﬁcient η {10 4, 10 3, 10 2}, and on the KL weighting coefﬁcient λ {0, 10 1, 100, 101, }. The number of hidden units of the RBM is set heuristically to 400 for all datasets. The learning rate is set heuristically to 0.01(λ 1) during the pretraining phase and modiﬁed to 0.01 min(1, λ 1) when training on the ﬁnal objective. The Wasserstein distance Wγ(ˆpθ, ˆp) is computed between the whole test distribution and the PCD sample at the end of the training procedure. This sample is a fast approximation of the true unbiased sample, that would otherwise have to be generated by annealing or enumeration of the states (see the supplement for a comparison of PCD and AIS samples).

4.2 Results and Analysis

The contour plots of Figure 2 show the effect of hyperparameters λ and η on the Wasserstein distance. For λ = , only the KL regularizer is active, which is equivalent to minimizing a standard RBM. As we reduce the amount of regularization, the Wasserstein distance becomes effectively minimized and thus smaller. If λ is chosen too small, the Wasserstein distance increases again, for the stability reasons mentioned in Section 3. In all our experiments, we observed that KL pretraining was necessary in order to reach low Wasserstein distance. Not doing so leads to degenerate solutions. The relation between hyperparameters and minimization criteria is consistent across datasets: In all cases, the Wasserstein RBM produces lower Wasserstein distance than a standard RBM.

MNIST-small PLANTS MNIST-code

0 0.1 1.0 10.0 inf Parameter λ

Parameter η

0 0.1 1.0 10.0 inf Parameter λ

Parameter η

0 0.1 1.0 10.0 inf Parameter λ

Parameter η

Figure 2: Wasserstein distance as a function of hyperparameters λ and η. The best RBMs in the Wasserstein sense (RBM-W) are shown in red. The best RBMs in the standard sense (i.e. with λ forced to +inf, and minimum KL) are shown in blue.

Samples generated by the standard RBM and the Wasserstein RBM (more precisely their PCD approximation) are shown in Figure 3. The RBM-W produces a reduced set of clean prototypical examples, with less noise than those produced by a regular RBM. All zeros generated by RBM-W

have well-deﬁned contours and a round shape but do not reproduce the variety of shapes present in the data. Similarly, the plants territorial spreads generated by the RBM-W form compact and contiguous regions that are prototypical of real spreads, but are less diverse than the data or the sample generated by the standard RBM. Finally, the RBM-W generates codes that, when decoded, are closer to actual MNIST digits.

MNIST-small PLANTS MNIST-code

128 binary units

28x28 pixels

128 binary units

28x28 pixels

400 binary units 400 binary units

MNIST-code digits generation

Figure 3: Examples generated by the standard and the Wasserstein RBMs. (Images for PLANTS dataset are automatically generated from the Wikimedia Commons template https://commons. wikimedia.org/wiki/File:Blank Map-USA-states-Canada-provinces.svg created by user Lokal_Proﬁl.) Images for MNIST-code are produced by the decoders shown on the right.

The PCA plots of Figure 4 superimpose to the true data distribution (in gray) the distributions generated by the standard RBM (in blue) and the Wasserstein RBM (in red). In particular, the plots show the projected distributions on the ﬁrst two PCA components of the true distribution. While the standard RBM distribution uniformly covers the data, the one generated by the RBM-W consists of a ﬁnite set of small dense clusters that are scattered across the input distribution. In other words, the Wasserstein model is biased towards these clusters, and systematically ignores other regions.

MNIST-small PLANTS MNIST-code

γ small γ large γ small γ large γ small γ large

Figure 4: Top: Two-dimensional PCA comparison of distributions learned by the RBM and the RBMW with smoothing parameter γ = 0.1. Plots are obtained by projecting the learned distributions on the ﬁrst two components of the true distribution. Bottom: RBM-W distributions obtained by varying the parameter γ.

At the bottom of Figure 4, we analyze the effect of the Wasserstein smoothing parameter γ on the learned distribution, with γ = 0.025, 0.05, 0.1, 0.2, 0.4. We observe on all datasets that the stronger the smoothing, the stronger the shrinkage effect. Although the KL-generated distributions shown in blue may look better (the red distribution strongly departs visually from the data distribution), the red distribution is actually superior if considering the smooth Wasserstein distance as a performance metric, as shown in Figure 2.

4.3 Validating the Shrinkage Effect

To verify that the shrinkage effect observed in Figure 4 is not a training artefact, but a truly expected property of the modeled distribution, we analyze this effect for a simple distribution for which the parameter space can be enumerated. Figure 5 plots the Wasserstein distance between samples of size 100 of a 10-dimensional Gaussian distribution p N(0, I), and a parameterized model of that distribution pθ N(0, θ2I), where θ [0, 1]. The parameter θ can be interpreted as a shrinkage

parameter. The Wasserstein distance is computed using the cityblock or euclidean metric, both rescaled such that the expected distance between pairs of points is 1.

0.0 0.2 0.4 0.6 0.8 1.0

model parameter θ

0.4 0.5 0.6 0.7 0.8 0.9 1.0

Wγ(ˆpθ, ˆp)

(metric = cityblock) γ = 1.00 γ = 0.32 γ = 0.10 γ = 0.03 γ = 0.01 γ = 0.00

0.0 0.2 0.4 0.6 0.8 1.0

model parameter θ

Wγ(ˆpθ, ˆp)

(metric = euclidean) γ = 1.00 γ = 0.32 γ = 0.10 γ = 0.03 γ = 0.01 γ = 0.00

Figure 5: Wasserstein distance between a sample ˆp N(0, I), and a sample ˆpθ N(0, θ2I) for various model parameters θ [0, 1] and smoothing γ, using the cityblock or the euclidean metric.

Interestingly, for all choices of Wasserstein smoothing parameters γ, and even for the true Wasserstein distance (γ = 0, computed here using the Open CV library), the best model pθ in the empirical Wasserstein sense is a shrinked version of p (i.e. with θ < 1). When the smoothing is strong enough, the best parameter becomes θ = 0 (i.e. Dirac distribution located at the origin). Overall, this experiment gives a training-independent validation for our observation that Wasserstein RBMs learn shrinked cluster-like distributions. Note that the ﬁnite sample size prevents the Wasserstein distance to reach zero, and always favors shrinked models.

4.4 Data Completion and Denoising

In order to demonstrate the practical relevance of Wasserstein distance minimization, we apply the learned models to the task of data completion and data denoising, for which the use of a metric is crucial: Data completion and data denoising performance is generally measured in terms of distance between the true data and the completed or denoised data (e.g. Euclidean distance for real-valued data, or Hamming distance H for binary data). Remotely located probability mass that may result from simple KL minimization would incur a severe penalty on the completion and denoising performance metric. Both tasks have useful practical applications: Data completion can be used as a ﬁrst step when applying discriminative learning (e.g. neural networks or SVM) to data with missing features. Data denoising can be used as a dimensionality reduction step before training a supervised model. Let the input x = [v, h] be composed of d k visible variables v and k hidden variables h.

Data Completion The setting of the data completion experiment is illustrated in Figure 6 (top). The distribution pθ(x|v) over possible reconstructions can be sampled from using an alternate Gibbs sampler, or by enumeration. The expected Hamming distance between the true state x and the reconstructed state modeled by the distribution pθ(x|v) is given by iterating on the 2k possible reconstructions: E = P

h pθ(x | v) H(x, x )

where h {0, 1}k. Since the reconstruction is a probability distribution, we can compute the expected Hamming error, but also its bias-variance decomposition. On MNIST-small, we hide randomly located image patches of size 3 3 (i.e. k = 9). On PLANTS and MNIST-code, we hide random subsets of k = 9 variables. Results are shown in Figure 7 (left), where we compare three types of models: Kernel density estimation (KDE), standard RBM (RBM) and Wasserstein RBM (RBM-W). The KDE estimation model uses a Gaussian kernel, with the Gaussian scale parameter chosen such that the KL divergence of the model from the validation data is minimized. The RBM-W is better or comparable the other models. Of particular interest is the structure of the expected Hamming error: For the standard RBM, a large part of the error comes from the variance (or entropy), while for the Wasserstein RBM, the bias term is the most contributing. This can be related to what is observed in Figure 4: For a data point outside the area covered by the red points, the reconstruction is systematically redirected towards the nearest red cluster, thus, incurring a systematic bias.

Data Denoising Here, we consider a simple noise process where for a predeﬁned subset of k variables, denoted by h a known number l of bits ﬂips occur randomly. Remaining d k variables are denoted by v. The setting of the experiment is illustrated in Figure 6 (bottom). Calling x the original and ex its noisy version resulting from ﬂipping l variables of h, the expected Hamming error

ﬂip two pixels

original image noisy image

ﬂip two pixels again 6 possible image reconstructions

2 2 4 2 0 2

hide three pixels

original image incomplete image

assign pixels again 8 possible image reconstructions

Completion Denoising

0.4 0.1 0 0.5 0

0.4 0 0 0 0.6 0

Figure 6: Illustration of the completion and denoising setup. For each image, we select a known subset of pixels, that we hide (or corrupt with noise). Each possible reconstruction has a particular Hamming distance to the original example. The expected Hamming error is computed by weighting the Hamming distances by the probability that the model assigns to the reconstructions.

Completion Denoising

MNIST-small PLANTS MNIST-code MNIST-small PLANTS MNIST-code

Hamming distance

KDE RBM RBM-W 0.0

1.5 variance bias

KDE RBM RBM-W 0.0

1.5 variance bias

KDE RBM RBM-W 0.0

variance bias

KDE RBM RBM-W 0.0

1.5 variance bias

KDE RBM RBM-W 0.0

1.5 variance bias

KDE RBM RBM-W 0.0

variance bias

Figure 7: Performance on the completion and denoising tasks of the kernel density estimation, the standard RBM and the Wasserstein RBM. The total length of the bars is the expected Hamming error. Dark gray and light gray sections of the bars give the bias-variance decomposition.

is given by iterating over the k l states x with same visible variables v and that are at distance l of ex:

h pθ(x | v, H(x, ex) = l) H(x, x )

where h {0, 1}k. Note that the original example x is necessarily part of this set of states under the noise model assumption. For the MNIST-small data, we choose randomly located images patches of size 4 3 or 3 4 (i.e. k = 12), and generate l = 4 random bit ﬂips within the selected patch. For PLANTS and MNIST-code, we generate l = 4 bit ﬂips in k = 12 randomly preselected input variables. Figure 7 (right) shows the denoising error in terms of expected Hamming distance on the same datasets. The RBM-W is better or comparable to other models. Like for the completion task, the main difference between the two RBMs is the bias/variance ratio, where again the Wasserstein RBM tends to have larger bias. This experiment has considered a very simple noise model consisting of a ﬁxed number of l random bit ﬂips over a small predeﬁned subset of variables. Denoising highly corrupted complex data will however require to combine Wasserstein models with more ﬂexible noise models such as the ones proposed by [17].

5 Conclusion

We have introduced a new objective for restricted Boltzmann machines (RBM) based on the smooth Wasserstein distance. We derived the gradient of the Wasserstein distance from its dual formulation, and used it to effectively train an RBM. Unlike the usual Kullback-Leibler (KL) divergence, our Wasserstein objective takes into account the metric of the data. In all considered scenarios, the Wasserstein RBM produced distributions that strongly departed from standard RBMs, and outperformed them on practical tasks such as completion or denoising.

More generally, we demonstrated empirically, that when learning probability densities, the reliance on distributions that incorporate indirectly the desired metric can be substituted for training procedures that make the desired metric directly part of the learning objective. Thus, Wasserstein training can be seen as a more direct approach to density estimation than regularized KL training. Future work will aim to further explore the interface between Boltzmann learning and Wasserstein minimization, with the aim to scale the newly proposed learning technique to larger and more complex data distributions.

Acknowledgements

This work was supported by the Brain Korea 21 Plus Program through the National Research Foundation of Korea funded by the Ministry of Education. This work was also supported by the grant DFG (MU 987/17-1). M. Cuturi gratefully acknowledges the support of JSPS young researcher A grant 26700002. Correspondence to GM, KRM and MC.

[1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for Boltzmann machines. Cognitive Science, 9(1):147 169, 1985.

[2] F. Bassetti, A. Bodini, and E. Regazzini. On minimum Kantorovich distance estimators. Statistics & Probability Letters, 76(12):1298 1302, 2006.

[3] K. Cho, T. Raiko, and A. Ilin. Enhanced gradient for training restricted Boltzmann machines. Neural Computation, 25(3):805 831, 2013.

[4] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2016.

[5] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, pages 2292 2300, 2013.

[6] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In Proceedings of the 31th International Conference on Machine Learning, ICML, pages 685 693, 2014.

[7] G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton. Phone recognition with the mean-covariance restricted Boltzmann machine. In Advances in Neural Information Processing Systems 23., pages 469 477, 2010.

[8] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. Poggio. Learning with a wasserstein loss. In NIPS, pages 2044 2052. 2015.

[9] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771 1800, 2002.

[10] P. J. Huber. Robust statistics. Springer, 2011.

[11] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, Nov 1998.

[12] B. M. Marlin, K. Swersky, B. Chen, and N. de Freitas. Inductive principles for restricted Boltzmann machine learning. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, AISTATS, pages 509 516, 2010.

[13] G. Montavon and K.-R. Müller. Deep Boltzmann machines and the centering trick. In Neural Networks: Tricks of the Trade - Second Edition, LNCS, pages 621 637. Springer, 2012.

[14] Y. Rubner, L. Guibas, and C. Tomasi. The earth mover s distance, multi-dimensional scaling, and colorbased image retrieval. In Proceedings of the ARPA Image Understanding Workshop, pages 661 668, 1997.

[15] R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In Proceedings of the Twelfth International Conference on Artiﬁcial Intelligence and Statistics, AISTATS, pages 448 455, 2009.

[16] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research, 15(1):2949 2980, 2014.

[17] Y. Tang, R. Salakhutdinov, and G. E. Hinton. Robust Boltzmann machines for recognition and denoising. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2264 2271, 2012.

[18] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML), pages 1064 1071, 2008.

[19] United States Department of Agriculture. The PLANTS Database, 2012.

[20] C. Villani. Optimal transport: old and new, volume 338. Springer Verlag, 2009.