# wasserstein_training_of_restricted_boltzmann_machines__3f7e7af5.pdf Wasserstein Training of Restricted Boltzmann Machines Grégoire Montavon Technische Universität Berlin gregoire.montavon@tu-berlin.de Klaus-Robert Müller Technische Universität Berlin klaus-robert.mueller@tu-berlin.de Marco Cuturi CREST, ENSAE, Université Paris-Saclay marco.cuturi@ensae.fr Boltzmann machines are able to learn highly complex, multimodal, structured and multiscale real-world data distributions. Parameters of the model are usually learned by minimizing the Kullback-Leibler (KL) divergence from training samples to the learned model. We propose in this work a novel approach for Boltzmann machine training which assumes that a meaningful metric between observations is known. This metric between observations can then be used to define the Wasserstein distance between the distribution induced by the Boltzmann machine on the one hand, and that given by the training sample on the other hand. We derive a gradient of that distance with respect to the model parameters. Minimization of this new objective leads to generative models with different statistical properties. We demonstrate their practical potential on data completion and denoising, for which the metric between observations plays a crucial role. 1 Introduction Boltzmann machines [1] are powerful generative models that can be used to approximate a large class of real-world data distributions, such as handwritten characters [9], speech segments [7], or multimodal data [16]. Boltzmann machines share similarities with neural networks in their capability to extract features at multiple scales, and to build well-generalizing hierarchical data representations [15, 13]. The restricted Boltzmann machine (RBM) is a special type of Boltzmann machine composed of one layer of latent variables, and defining a probability distribution pθ(x) over a set of d binary observed variables whose state is represented by the binary vector x {0, 1}d, and with a parameter vector θ to be learned. Given an empirical probability distribution ˆp(x) = 1 N PN n=1 δxn where (xn)n is a list of N observations in {0, 1}d, an RBM can be trained using information-theoretic divergences (see for example [12]) by minimizing with respect to θ a divergence (ˆp, pθ) between the sample empirical measure ˆp and the modeled distribution pθ. When is for instance the KL divergence, this approach results in the well-known Maximum Likelihood Estimator (MLE), which yields gradients for the θ of the form θKL(ˆp pθ) = 1 n=1 θ log pθ(xn) = θ log pθ(x) where the bracket notation p indicates an expectation with respect to p. Alternative choices for are the Bhattacharrya/Hellinger and Euclidean distances between distributions, or more generally Also with the Department of Brain and Cognitive Engineering, Korea University. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. F-divergences or M-estimators [10]. They all result in comparable gradient terms, that try to adjust θ so that the fitting terms pθ(xn) grow as large as possible. We explore in this work a different scenario: what if θ is chosen so that pθ(x) is large, on average, when x is close to a data point xn in some sense, but not necessarily when x coincides exactly with xn? To adopt such a geometric criterion, we must first define what closeness between observations means. In almost all applications of Boltzmann machines, such a metric between observations is readily available: One can for example consider the Hamming distance between binary vectors, or any other metric motivated by practical considerations2. This being done, the geometric criterion we have drawn can be materialized by considering for the Wasserstein distance [20] (a.k.a. the Kantorovich or the earth mover s distance [14]) between measures. This choice was considered in theory by [2], who proved its statistical consistency, but was never considered practically to the best of our knowledge. This paper describes a practical derivation for a minimum Kantorovich distance estimator [2] for Boltzmann machines, which can scale up to tens of thousands of observations. As will be described in this paper, recent advances in the fast approximation of Wasserstein distances [5] and their derivatives [6] play an important role in the practical implementation of these computations. Before describing this approach in detail, we would like to insist that measuring goodness-of-fit with the Wasserstein distance results in a considerably different perspective than that provided by a Kullback-Leibler/MLE approach. This difference is illustrated in Figure 1, where a probability pθ can be close from a KL perspective to a given empirical measure ˆp, but far from the same measure p in the Wasserstein sense. Conversely, a different probability pθ can miss the mark from a KL viewpoint but achieve a low Wasserstein distance to ˆp. Before proceeding to the rest of this paper, let us mention that Wasserstein distances have a broad appeal for machine learning. That distance was for instance introduced in the context of supervised inference by [8], who used it to compute a predictive loss between the output of a multilabel classifier against its ground truth, or for domain adaptation, by [4]. small overlap small distance large distance large overlap Data distribution Model 1 Model 2 Figure 1: Empirical distribution ˆp(x) (gray) defined on the set of states {0, 1}d with d = 3 superposed to two possible models of it defined on the same set of states. The size of the circles indicates the probability mass for each state. For each model, we show its KL and Wasserstein divergences from ˆp(x), and an explanation of why the divergences are low or high: a large/small overlap with ˆp(x), or a large/small distance from ˆp(x). 2 Minimum Wasserstein Distance Estimation Consider two probabilities p, q in P(X), the set of probabilities on X = {0, 1}d. Namely, two maps p, q in RX + such that P x q(x) = 1, where we omit x X under the summation sign. Consider a cost function defined on X X, typically a distance D : X X R+. Given a constant γ 0, the γ-smoothed Wasserstein distance [5] is equal to Wγ(p, q) = min π Π(p,q) D(x, x ) π γH(π), (2) where Π(p, q) is the set of joint probabilities π on X X such that P x π(x, x ) = p(x), P x π(x, x ) = q(x ) and H(π) = P xx π(x, x ) log π(x, x ) is the Shannon entropy of π. This optimization problem, a strictly convex program, has an equivalent dual formulation [6] which involves instead two real-valued functions α, β in RX and which plays an important role in this paper: Wγ(p, q) = max α,β RX α(x) p + β(x ) q γ X xx e 1 γ (α(x)+β(x ) D(x,x )) 1. (3) 2When using the MLE principle, metric considerations play a key role to define densities pθ, e.g. the reliance of Gaussian densities on Euclidean distances. This is the kind of metric we take for granted in this work. Smooth Wasserstein Distances The true Wasserstein distance corresponds to the case where γ = 0, that is when Equation (2) is stripped of the entropic term. One can easily verify that that definition matches the usual linear program used to describe Wasserstein/EMD distances [14]. When γ 0 in Equation (3), one also recovers the Kantorovich dual formulation, because the rightmost regularizer converges to the indicator function of the feasible set of the dual optimal transport problem, α(x) + β(x ) D(x, x ). We consider in this paper the case γ > 0 because it was shown in [5] to considerably facilitate computations, and in [6] to result in a divergence Wγ(p, q) which, unlike the case γ = 0, is differentiable w.r.t to the first variable. Looking at the dual formulation in Equation (3), one can see that this gradient is equal to α , the centered optimal dual variable (the centering step for α ensures the orthogonality with respect to the simplex constraint). Sensitivity analysis gives a clear interpretation to the quantity α (x): It measures the cost for each unit of mass placed by p at x when computing the Wasserstein distance Wγ(p, q). To decrease Wγ(p, q), it might thus be favorable to transfer mass in p from points where α(x) is high to place it on points where α(x) is low. This idea can be used, by a simple application of the chain rule, to minimize, given a fixed target probability p, the quantity Wγ(pθ, p) with respect to θ. Proposition 1. Let pθ(x) = 1 Z e Fθ(x) be a parameterized family of probability distributions where Fθ(x) is a differentiable function of θ Θ and we write Gθ = θFθ(x) pθ. Let α be the centered optimal dual solution of Wγ(pθ, p) as described in Equation (3). The gradient of the smoothed Wasserstein distance with respect to θ is given by θWγ(pθ, p) = α (x) pθGθ α (x) θFθ(x)) Proof. This result is a direct application of the chain rule: We have θWγ(pθ, p) = pθ T Wγ(pθ, q) As mentioned in [6], the rightmost term is the optimal dual variable (the Kantorovich potential) Wγ(pθ, q)/ pθ = α . The Jacobian ( pθ/ θ) is a linear map Θ X. For a given x , pθ(x )/ θ = pθ(x )Gθ Fθ(x )pθ(x ). As a consequence, pθ θ T α is the integral w.r.t. x of the term above multiplied by α (x ), which results in Equation (4). Comparison with the KL Fitting Error The target distribution p plays a direct role in the formation of the gradient of KL(ˆp pθ) w.r.t. θ through the term θFθ(x) p in Equation (1). The Wasserstein gradient incorporates the knowledge of p in a different way, by considering, on the support of pθ only, points x that correspond to high potentials (costs) α(x) when computing the distance of pθ to p. A high potential at x means that the probability pθ(x) should be lowered if one were to decrease Wγ(pθ, p), by varying θ accordingly. Sampling Approximation The gradient in Equation (4) is intractable, since it involves solving an optimal (smoothed) transport problem over probabilities defined on 2d states. In practice, we replace expectations w.r.t pθ by an empirical distribution formed by sampling from the model pθ (e.g. the PCD sample [18]). Given a sample (exn)n of size e N generated by the model, we define ˆpθ = P e N n=1 δexn/ e N. The tilde is used to differentiate the sample generated by the model from the empirical observations. Because the dual potential α is centered and ˆpθ is a measure with uniform weights, α (x) ˆpθ = 0 which simplifies the approximation of the gradient to b θWγ(pθ, ˆp) = 1 n=1 ˆα (exn) θFθ(exn) (5) where ˆα is the solution of the discrete smooth Wasserstein dual between the two empirical distributions ˆp and ˆpθ, which have respectively supports of size N and e N. In practical terms, ˆα is a vector of size e N, one coefficient for each PCD sample, which can be computed by following the algorithm below [6]. To keep notations simple, we describe it in terms of generic probabilities p and q, having in mind these are in practice the training and simulated empirical measures ˆp and ˆpθ. Computing α When γ > 0, the optimal variable α corresponding to Wγ(p, q) can be recovered through the Sinkhorn algorithm with a cost which grows as the product |p||q| of the sizes of the support of p and q, where |p| = P x 1p(x)>0. The algorithm is well known but we adapt it here to our setting, see [6, Alg.3] for a more precise description. To ease notations, we consider an arbitrary ordering of X, a set of cardinal 2d, and identify its elements with indices 1 i 2d. Let I = (i1, , i|p|) be the ordered family of indices in the set {i | p(i) > 0} and define J accordingly for q. I and J have respective lengths |p| and |q|. Form the matrix K = [e D(i,j)/γ]i I,j J of size |p| and |q|. Choose now two positive vectors u R|p| ++ and v R|q| ++ at random, and repeat until u, v converge in some metric the operations u p/(Kv), v q/(KT u). Upon convergence, the optimal variable α is zero everywhere except for α (ia) = log(ua/ u)/γ where 1 a |p| and u is the geometric mean of vector u (which ensures that α is centered). 3 Application to Restricted Boltzmann Machines The restricted Boltzmann machine (RBM) is a generative model of binary data that is composed of d binary observed variables and h binary explanatory variables. The vector x {0, 1}d represents the state of observed variables, and the vector y {0, 1}h represents the state of explanatory variables. The RBM associates to each configuration x of observed variables a probability pθ(x) defined as pθ(x) = P ye Eθ(x,y)/Zθ, where Eθ(x, y) = a T x Ph j=1 yj(w T j x + bj) is called the energy and Zθ = P x,y e Eθ(x,y) is the partition function that normalizes the probability distribution to one. The parameters θ = (a, {wj, bj}h j=1) of the RBM are learned from the data. Knowing the state x of the observed variables, the explanatory variables are independent Bernoulli-distributed with Pr(yj = 1|x) = σ(w T j x + bj), where σ is the logistic map z 7 (1 + e z) 1. Conversely, knowing the state y of the explanatory variables, the observed variables on which the probability distribution is defined can also be sampled independently, leading to an efficient alternate Gibbs sampling procedure for pθ. In this RBM model, explanatory variables can be analytically marginalized, which yields the following probability model: pθ(x) = e Fθ(x)/Z θ, where Fθ(x) = a T x Ph j=1 log(1 + exp(w T j x + bj)) is the free energy associated to this model and Z θ = P x e Fθ(x) is the partition function. Wasserstein Gradient of the RBM Having written the RBM in its free energy form, the Wasserstein gradient can be obtained by computing the gradient of Fθ(x) and injecting it in Equation (5): b wj Wγ(ˆp, pθ) = α (x) σ(zj) x where zj = w T j x + bj. Gradients with respect to parameters a and {bj}j can also be obtained by the same means. In comparison, the gradient of the KL divergence is given by b wj KL(ˆp pθ) = σ(zj) x ˆpθ σ(zj) x While the Wasserstein gradient can in the same way as the KL gradient be expressed in a very simple form, the first one is not sum-decomposable. A simple manifestation of the non-decomposability occurs for e N = 1 (smallest possible sample size): In that case, α(exn) = 0 due to the centering constraint (see Section 2), thus making the gradient zero. Stability and KL Regularization Unlike the KL gradient, the Wasserstein gradient depends only indirectly on the data distribution ˆp. This is a problem when the sample ˆpθ generated by the model strongly differs from the examples coming from ˆp, because there is no weighting (α(exn))n of the generated sample that can represent the desired direction in the parameter space Θ. In that case, the Wasserstein gradient will point to a bad local minimum. Closeness between the two empirical samples from this optimization perspective can be ensured by adding a regularization term to the objective that incorporates both the usual quadratic containment term, but also the KL term, that forces proximity to ˆp due to the direct dependence of its gradient on it. The optimization problem becomes: min θ Θ Wγ(ˆp, pθ) + λ Ω(θ) with Ω(θ) = KL(ˆp pθ) + η ( a 2 + P j wj 2) starting at point θ0 = arg minθ Θ Ω(θ), and where λ, η are two regularization hyperparameters that must be selected. Determining the starting point θ0 is analogous to having an initial pretraining phase. Thus, the proposed Wasserstein procedure can also be seen as finetuning a standard RBM, and forcing the finetuning not to deviate too much from the pretrained solution. 4 Experiments We perform several experiments that demonstrate that Wasserstein-trained RBMs learn distributions that are better from a metric perspective. First, we explore what are the main characteristics of a learned distribution that optimizes the Wasserstein objective. Then, we investigate the usefulness of these learned models on practical problems, such as data completion and denoising, where the metric between observations occurs in the performance evaluation. We consider three datasets: MNIST-small, a subsampled version of the original MNIST dataset [11] with only the digits 0 retained, a subset of the UCI PLANTS dataset [19] containing the geographical spread of plants species, and MNIST-code, 128-dimensional code vectors associated to each MNIST digit (additional details in the supplement). 4.1 Training, Validation and Evaluation All RBM models that we investigate are trained in full batch mode, using for ˆpθ the PCD approximation [18] of pθ, where the sample is refreshed at each gradient update by one step of alternate Gibbs sampling, starting from the sample at the previous time step. We choose a PCD sample of same size as the training set (N = e N). The coefficients α1, . . . , α e N occurring in the Wasserstein gradient are obtained by solving the smoothed Wasserstein dual between ˆp and ˆpθ, with smoothing parameter γ = 0.1 and distance D(x, x ) = H(x, x )/ H(x, x ) ˆp, where H denotes the Hamming distance between two binary vectors. We use the centered parameterization of the RBM for gradient descent [13, 3]. We perform holdout validation on the quadratic containment coefficient η {10 4, 10 3, 10 2}, and on the KL weighting coefficient λ {0, 10 1, 100, 101, }. The number of hidden units of the RBM is set heuristically to 400 for all datasets. The learning rate is set heuristically to 0.01(λ 1) during the pretraining phase and modified to 0.01 min(1, λ 1) when training on the final objective. The Wasserstein distance Wγ(ˆpθ, ˆp) is computed between the whole test distribution and the PCD sample at the end of the training procedure. This sample is a fast approximation of the true unbiased sample, that would otherwise have to be generated by annealing or enumeration of the states (see the supplement for a comparison of PCD and AIS samples). 4.2 Results and Analysis The contour plots of Figure 2 show the effect of hyperparameters λ and η on the Wasserstein distance. For λ = , only the KL regularizer is active, which is equivalent to minimizing a standard RBM. As we reduce the amount of regularization, the Wasserstein distance becomes effectively minimized and thus smaller. If λ is chosen too small, the Wasserstein distance increases again, for the stability reasons mentioned in Section 3. In all our experiments, we observed that KL pretraining was necessary in order to reach low Wasserstein distance. Not doing so leads to degenerate solutions. The relation between hyperparameters and minimization criteria is consistent across datasets: In all cases, the Wasserstein RBM produces lower Wasserstein distance than a standard RBM. MNIST-small PLANTS MNIST-code 0 0.1 1.0 10.0 inf Parameter λ Parameter η 0 0.1 1.0 10.0 inf Parameter λ Parameter η 0 0.1 1.0 10.0 inf Parameter λ Parameter η Figure 2: Wasserstein distance as a function of hyperparameters λ and η. The best RBMs in the Wasserstein sense (RBM-W) are shown in red. The best RBMs in the standard sense (i.e. with λ forced to +inf, and minimum KL) are shown in blue. Samples generated by the standard RBM and the Wasserstein RBM (more precisely their PCD approximation) are shown in Figure 3. The RBM-W produces a reduced set of clean prototypical examples, with less noise than those produced by a regular RBM. All zeros generated by RBM-W have well-defined contours and a round shape but do not reproduce the variety of shapes present in the data. Similarly, the plants territorial spreads generated by the RBM-W form compact and contiguous regions that are prototypical of real spreads, but are less diverse than the data or the sample generated by the standard RBM. Finally, the RBM-W generates codes that, when decoded, are closer to actual MNIST digits. MNIST-small PLANTS MNIST-code 128 binary units 28x28 pixels 128 binary units 28x28 pixels 400 binary units 400 binary units MNIST-code digits generation Figure 3: Examples generated by the standard and the Wasserstein RBMs. (Images for PLANTS dataset are automatically generated from the Wikimedia Commons template https://commons. wikimedia.org/wiki/File:Blank Map-USA-states-Canada-provinces.svg created by user Lokal_Profil.) Images for MNIST-code are produced by the decoders shown on the right. The PCA plots of Figure 4 superimpose to the true data distribution (in gray) the distributions generated by the standard RBM (in blue) and the Wasserstein RBM (in red). In particular, the plots show the projected distributions on the first two PCA components of the true distribution. While the standard RBM distribution uniformly covers the data, the one generated by the RBM-W consists of a finite set of small dense clusters that are scattered across the input distribution. In other words, the Wasserstein model is biased towards these clusters, and systematically ignores other regions. MNIST-small PLANTS MNIST-code γ small γ large γ small γ large γ small γ large Figure 4: Top: Two-dimensional PCA comparison of distributions learned by the RBM and the RBMW with smoothing parameter γ = 0.1. Plots are obtained by projecting the learned distributions on the first two components of the true distribution. Bottom: RBM-W distributions obtained by varying the parameter γ. At the bottom of Figure 4, we analyze the effect of the Wasserstein smoothing parameter γ on the learned distribution, with γ = 0.025, 0.05, 0.1, 0.2, 0.4. We observe on all datasets that the stronger the smoothing, the stronger the shrinkage effect. Although the KL-generated distributions shown in blue may look better (the red distribution strongly departs visually from the data distribution), the red distribution is actually superior if considering the smooth Wasserstein distance as a performance metric, as shown in Figure 2. 4.3 Validating the Shrinkage Effect To verify that the shrinkage effect observed in Figure 4 is not a training artefact, but a truly expected property of the modeled distribution, we analyze this effect for a simple distribution for which the parameter space can be enumerated. Figure 5 plots the Wasserstein distance between samples of size 100 of a 10-dimensional Gaussian distribution p N(0, I), and a parameterized model of that distribution pθ N(0, θ2I), where θ [0, 1]. The parameter θ can be interpreted as a shrinkage parameter. The Wasserstein distance is computed using the cityblock or euclidean metric, both rescaled such that the expected distance between pairs of points is 1. 0.0 0.2 0.4 0.6 0.8 1.0 model parameter θ 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Wγ(ˆpθ, ˆp) (metric = cityblock) γ = 1.00 γ = 0.32 γ = 0.10 γ = 0.03 γ = 0.01 γ = 0.00 0.0 0.2 0.4 0.6 0.8 1.0 model parameter θ Wγ(ˆpθ, ˆp) (metric = euclidean) γ = 1.00 γ = 0.32 γ = 0.10 γ = 0.03 γ = 0.01 γ = 0.00 Figure 5: Wasserstein distance between a sample ˆp N(0, I), and a sample ˆpθ N(0, θ2I) for various model parameters θ [0, 1] and smoothing γ, using the cityblock or the euclidean metric. Interestingly, for all choices of Wasserstein smoothing parameters γ, and even for the true Wasserstein distance (γ = 0, computed here using the Open CV library), the best model pθ in the empirical Wasserstein sense is a shrinked version of p (i.e. with θ < 1). When the smoothing is strong enough, the best parameter becomes θ = 0 (i.e. Dirac distribution located at the origin). Overall, this experiment gives a training-independent validation for our observation that Wasserstein RBMs learn shrinked cluster-like distributions. Note that the finite sample size prevents the Wasserstein distance to reach zero, and always favors shrinked models. 4.4 Data Completion and Denoising In order to demonstrate the practical relevance of Wasserstein distance minimization, we apply the learned models to the task of data completion and data denoising, for which the use of a metric is crucial: Data completion and data denoising performance is generally measured in terms of distance between the true data and the completed or denoised data (e.g. Euclidean distance for real-valued data, or Hamming distance H for binary data). Remotely located probability mass that may result from simple KL minimization would incur a severe penalty on the completion and denoising performance metric. Both tasks have useful practical applications: Data completion can be used as a first step when applying discriminative learning (e.g. neural networks or SVM) to data with missing features. Data denoising can be used as a dimensionality reduction step before training a supervised model. Let the input x = [v, h] be composed of d k visible variables v and k hidden variables h. Data Completion The setting of the data completion experiment is illustrated in Figure 6 (top). The distribution pθ(x|v) over possible reconstructions can be sampled from using an alternate Gibbs sampler, or by enumeration. The expected Hamming distance between the true state x and the reconstructed state modeled by the distribution pθ(x|v) is given by iterating on the 2k possible reconstructions: E = P h pθ(x | v) H(x, x ) where h {0, 1}k. Since the reconstruction is a probability distribution, we can compute the expected Hamming error, but also its bias-variance decomposition. On MNIST-small, we hide randomly located image patches of size 3 3 (i.e. k = 9). On PLANTS and MNIST-code, we hide random subsets of k = 9 variables. Results are shown in Figure 7 (left), where we compare three types of models: Kernel density estimation (KDE), standard RBM (RBM) and Wasserstein RBM (RBM-W). The KDE estimation model uses a Gaussian kernel, with the Gaussian scale parameter chosen such that the KL divergence of the model from the validation data is minimized. The RBM-W is better or comparable the other models. Of particular interest is the structure of the expected Hamming error: For the standard RBM, a large part of the error comes from the variance (or entropy), while for the Wasserstein RBM, the bias term is the most contributing. This can be related to what is observed in Figure 4: For a data point outside the area covered by the red points, the reconstruction is systematically redirected towards the nearest red cluster, thus, incurring a systematic bias. Data Denoising Here, we consider a simple noise process where for a predefined subset of k variables, denoted by h a known number l of bits flips occur randomly. Remaining d k variables are denoted by v. The setting of the experiment is illustrated in Figure 6 (bottom). Calling x the original and ex its noisy version resulting from flipping l variables of h, the expected Hamming error flip two pixels original image noisy image flip two pixels again 6 possible image reconstructions 2 2 4 2 0 2 hide three pixels original image incomplete image assign pixels again 8 possible image reconstructions Completion Denoising 0.4 0.1 0 0.5 0 0.4 0 0 0 0.6 0 Figure 6: Illustration of the completion and denoising setup. For each image, we select a known subset of pixels, that we hide (or corrupt with noise). Each possible reconstruction has a particular Hamming distance to the original example. The expected Hamming error is computed by weighting the Hamming distances by the probability that the model assigns to the reconstructions. Completion Denoising MNIST-small PLANTS MNIST-code MNIST-small PLANTS MNIST-code Hamming distance KDE RBM RBM-W 0.0 1.5 variance bias KDE RBM RBM-W 0.0 1.5 variance bias KDE RBM RBM-W 0.0 variance bias KDE RBM RBM-W 0.0 1.5 variance bias KDE RBM RBM-W 0.0 1.5 variance bias KDE RBM RBM-W 0.0 variance bias Figure 7: Performance on the completion and denoising tasks of the kernel density estimation, the standard RBM and the Wasserstein RBM. The total length of the bars is the expected Hamming error. Dark gray and light gray sections of the bars give the bias-variance decomposition. is given by iterating over the k l states x with same visible variables v and that are at distance l of ex: h pθ(x | v, H(x, ex) = l) H(x, x ) where h {0, 1}k. Note that the original example x is necessarily part of this set of states under the noise model assumption. For the MNIST-small data, we choose randomly located images patches of size 4 3 or 3 4 (i.e. k = 12), and generate l = 4 random bit flips within the selected patch. For PLANTS and MNIST-code, we generate l = 4 bit flips in k = 12 randomly preselected input variables. Figure 7 (right) shows the denoising error in terms of expected Hamming distance on the same datasets. The RBM-W is better or comparable to other models. Like for the completion task, the main difference between the two RBMs is the bias/variance ratio, where again the Wasserstein RBM tends to have larger bias. This experiment has considered a very simple noise model consisting of a fixed number of l random bit flips over a small predefined subset of variables. Denoising highly corrupted complex data will however require to combine Wasserstein models with more flexible noise models such as the ones proposed by [17]. 5 Conclusion We have introduced a new objective for restricted Boltzmann machines (RBM) based on the smooth Wasserstein distance. We derived the gradient of the Wasserstein distance from its dual formulation, and used it to effectively train an RBM. Unlike the usual Kullback-Leibler (KL) divergence, our Wasserstein objective takes into account the metric of the data. In all considered scenarios, the Wasserstein RBM produced distributions that strongly departed from standard RBMs, and outperformed them on practical tasks such as completion or denoising. More generally, we demonstrated empirically, that when learning probability densities, the reliance on distributions that incorporate indirectly the desired metric can be substituted for training procedures that make the desired metric directly part of the learning objective. Thus, Wasserstein training can be seen as a more direct approach to density estimation than regularized KL training. Future work will aim to further explore the interface between Boltzmann learning and Wasserstein minimization, with the aim to scale the newly proposed learning technique to larger and more complex data distributions. Acknowledgements This work was supported by the Brain Korea 21 Plus Program through the National Research Foundation of Korea funded by the Ministry of Education. This work was also supported by the grant DFG (MU 987/17-1). M. Cuturi gratefully acknowledges the support of JSPS young researcher A grant 26700002. Correspondence to GM, KRM and MC. [1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for Boltzmann machines. Cognitive Science, 9(1):147 169, 1985. [2] F. Bassetti, A. Bodini, and E. Regazzini. On minimum Kantorovich distance estimators. Statistics & Probability Letters, 76(12):1298 1302, 2006. [3] K. Cho, T. Raiko, and A. Ilin. Enhanced gradient for training restricted Boltzmann machines. Neural Computation, 25(3):805 831, 2013. [4] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2016. [5] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, pages 2292 2300, 2013. [6] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In Proceedings of the 31th International Conference on Machine Learning, ICML, pages 685 693, 2014. [7] G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton. Phone recognition with the mean-covariance restricted Boltzmann machine. In Advances in Neural Information Processing Systems 23., pages 469 477, 2010. [8] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. Poggio. Learning with a wasserstein loss. In NIPS, pages 2044 2052. 2015. [9] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771 1800, 2002. [10] P. J. Huber. Robust statistics. Springer, 2011. [11] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, Nov 1998. [12] B. M. Marlin, K. Swersky, B. Chen, and N. de Freitas. Inductive principles for restricted Boltzmann machine learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS, pages 509 516, 2010. [13] G. Montavon and K.-R. Müller. Deep Boltzmann machines and the centering trick. In Neural Networks: Tricks of the Trade - Second Edition, LNCS, pages 621 637. Springer, 2012. [14] Y. Rubner, L. Guibas, and C. Tomasi. The earth mover s distance, multi-dimensional scaling, and colorbased image retrieval. In Proceedings of the ARPA Image Understanding Workshop, pages 661 668, 1997. [15] R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS, pages 448 455, 2009. [16] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research, 15(1):2949 2980, 2014. [17] Y. Tang, R. Salakhutdinov, and G. E. Hinton. Robust Boltzmann machines for recognition and denoising. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2264 2271, 2012. [18] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML), pages 1064 1071, 2008. [19] United States Department of Agriculture. The PLANTS Database, 2012. [20] C. Villani. Optimal transport: old and new, volume 338. Springer Verlag, 2009.