# source_separation_with_deep_generative_priors__b726d499.pdf

Source Separation with Deep Generative Priors

Vivek Jayaram * John Thickstun *

Despite substantial progress in signal source separation, results for richly structured data continue to contain perceptible artifacts. In contrast, recent deep generative models can produce authentic samples in a variety of domains that are indistinguishable from samples of the data distribution. This paper introduces a Bayesian approach to source separation that uses generative models as priors over the components of a mixture of sources, and noise-annealed Langevin dynamics to sample from the posterior distribution of sources given a mixture. This decouples the source separation problem from generative modeling, enabling us to directly use cutting-edge generative models as priors. The method achieves state-of-the-art performance for MNIST digit separation. We introduce new methodology for evaluating separation quality on richer datasets, providing quantitative evaluation of separation results on CIFAR-10. We also provide qualitative results on LSUN.

1. Introduction

The single-channel source separation problem (Davies & James, 2007) asks us to decompose a mixed signal m X into a linear combination of k components x1, . . . , xk X with scalar mixing coefﬁcients αi R:

i=1 αixi. (1)

This is motivated by, for example, the cocktail party problem of isolating the utterances of individual speakers xi from an audio mixture m captured at a busy party, where multiple speakers are talking simultaneously.

* Equal contribution. Paul G. Allen School of Computer Science and Engineering, University of Washington. Correspondence to: Vivek Jayaram <vjayaram@cs.washington.edu>, John Thickstun <thickstn@cs.washington.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

With no further constraints or regularization, solving Equation (1) for x is highly underdetermined. Classical blind approaches to single-channel source separation resolve this ambiguity by privileging solutions to (1) that satisﬁy mathematical constraints on the components x, such as statistical independence (Davies & James, 2007) sparsity (Lee et al., 1999) or non-negativity (Lee & Seung, 1999). These constraints can be be viewed as weak priors on the structure of sources, but the approaches are blind in the sense that they do not require adaptation to a particular dataset.

Recently, most works have taken a data-driven approach. To separate a mixture of sources, it is natural to suppose that we have access to samples x of individual sources, which can be used as a reference for what the source components of a mixture are supposed to look like. This data can be used to regularize solutions of Equation (1) towards structurally plausible solutions. The prevailing way to do this is to construct a supervised regression model that maps an input mixture m to components xi (Huang et al., 2014; Halperin et al., 2019). Paired training data (m, x) can be constructed by summing randomly chosen samples from the component distributions xi and labeling these mixtures with the ground truth components.

Instead of regressing against components x, we use samples to train a generative prior p(x); we separate a mixed signal m by sampling from the posterior distribution p(x|m). For some mixtures this posterior is quite peaked, and sampling from p(x|m) recovers the only plausible separation of m into likely components. But in many cases, mixtures are highly ambiguous: see, for example, the orange-highlighted MNIST images in Figure 1. This motivates our interest in sampling, which explores the space of plausible separations. In Section 3 we introduce a procedure for sampling from the posterior, an extension of the noise-annealed Langevin dynamics introduced in Song & Ermon (2019), which we call Bayesian Annealed SIgnal Source separation: BASIS separation.

Ambiguous mixtures pose a challenge for traditional source separation metrics, which presume that the original mixture components are identiﬁable and compare the separated components to ground truth. For ambiguous mixtures of rich data, we argue that recovery of the original mixture components is not a well-posed problem. Instead, the problem

Source Separation with Deep Generative Priors

Original Images

Mixture (Input)

Separated Images

Original Separated

Figure 1. Separation results for mixtures of four images from the MNIST dataset (Left) and two images from the CIFAR-10 dataset (Right), using BASIS with the NCSN (Song & Ermon, 2019) generative model as a prior over images. We draw attention to the central panel of the MNIST results (highlighted in orange), which shows how a mixture can be separated in multiple ways.

we aim to solve is ﬁnding components of a mixture that are consistent with a particular data distribution. Motivated by this perspective, we discuss evaluation metrics in Section 4.

Formulating the source separation problem in a Bayesian framework decouples the problem of source generation from source separation. This allows us to leverage pre-trained, state-of-the-art, likelihood-based generative models as prior distributions, without requiring architectural modiﬁcations to adapt these models for source separation. Examples of source separation using noise-conditioned score networks (NCSN) (Song & Ermon, 2019) as a prior are presented in Figure 1. Further separation results using NCSN and Glow (Kingma & Dhariwal, 2018) are presented in Section 5.

2. Related Work

Blind separation. Work on blind source separation is dataagnostic, relying on generic mathematical properties to privilege particular solutions to (1) (Comon, 1994; Bell & Sejnowski, 1995; Davies & James, 2007; Huang et al., 2012). Because blind methods have no access to sample components, they face the challenging task of modeling the distribution over unobserved components while simultaneously decomposing mixtures into likely components. It is difﬁcult to ﬁt a rich model to latent components, so blind methods often rely on simple models such as dictionaries to capture the structure of these components.

One promising recent work in the blind setting is Double DIP (Gandelsman et al., 2019). This work leverages the unsupervised Deep Image Prior (Ulyanov et al., 2018) as a prior over signal components, similar to our use of a

trained generative model. But the authors of this work document fundamental obstructions to applying their method to single-channel source separation; they propose using multiple image frames from a video, or multiple mixtures of the same components with different mixing coefﬁcients α. This multiple-mixture approach is common to much of the work on blind separation. In contrast, our approach is able to separate components from a single mixture.

Supervised regression. Regression models for source separation learn to predict components for a mixture using a dataset of mixed signals labeled with ground truth components. This approach has been extensively studied for separation of images (Halperin et al., 2019), audio spectrograms (Huang et al., 2014; 2015; Nugraha et al., 2016; Jansson et al., 2017), and raw audio (Lluis et al., 2019; Stoller et al., 2018b; D efossez et al., 2019), as well as more exotic data domains, e.g. medical imaging (Nishida et al., 1999). By learning to predict components (or equivalently, masks on a mixture) this approach implicitly builds a generative model of the signal components. This connection is made more explicit in recent work that uses GAN s to force components emitted by a regression model to match the distribution of a given dataset (Zhang et al., 2018; Stoller et al., 2018a).

The supervised approach takes advantage of expressive deep models to capture a strong prior over signal components. But it requires specialized model architectures trained specifically for the source separation task. In contrast, our approach leverages standard, pre-trained generative models for source separation. Furthermore, our approach can directly exploit ongoing advances in likelihood-based generative modeling to improve separation results.

Source Separation with Deep Generative Priors

Signal Dictionaries. Much work on source separation is based on the concept of a signal dictionary, most notably the line of work based on non-negative matrix factorization (NMF) (Lee & Seung, 2001). These approaches model signals as combinations of elements in a latent dictionary. Decomposing a mixture into dictionary elements can be used for source separation by (1) clustering the elements of the dictionary and (2) reconstituting a source using elements of the decomposition associated with a particular cluster.

Dictionaries are typically learned from data of each source type and combined into a joint dictionary, clustered by source type (Schmidt & Olsson, 2006; Virtanen, 2007). The blind setting has also been explored, where the clustering is obtained without labels by e.g. k-means (Spiertz & Gnann, 2009). Recent work explores more expressive decomposition models, replacing the linear decompositions used in NMF with expressive neural autoencoders (Smaragdis & Venkataramani, 2017; Venkataramani et al., 2017).

When the dictionary is learned with supervision from labeled sources, dictionary clusters can be interpreted as implicit priors on the distributions over components. Our approach makes these prior explicit, and works with generic priors that are not tied to the dictionary model. Furthermore, our method can separate mixed sources of the same type, whereas mixtures of sources with similar structure present a conceptual difﬁculty for dictionary-based methods.

Generative adversarial separation. Recent work by Subakan & Smaragdis (2018) and Kong et al. (2019) explores the intriguing possibility of optimizing x given a mixture m to satisfy (1), where components xi are constrained to the manifold learned by a GAN. The GAN is pre-trained to model a distribution over components. Like our method, this approach leverages modern deep generative models in a way that decouples generation from source separation. We view this work as a natural analog to our likelihood-based approach in the GAN setting.

Likelihood-based approaches. Our approach is similar in spirit to older ideas based on maximum a posteriori estimation (Geman & Geman, 1984) likelihood maximization (Pearlmutter & Parra, 1997; Roweis, 2001) and Bayesian source separation (Benaroya et al., 2005). We build upon their insights, with the advantage of increased computational resources and modern expressive generative models.

3. BASIS Separation

We consider the following generative model of a mixed signal m, relaxing the mixture constraint g(x) = m to a soft Gaussian approximation:

m N g(x), γ2I . (3)

Algorithm 1 BASIS Separation

Input: m X, {σi}L i=1, δ, T Sample x1, . . . , xk Uniform(X) for i 1 to L do

ηi δ σ2 i /σ2 L for t = 1 to T do

Sample εt N(0, I) u(t) x(t) + ηi x log pσi(x(t)) + 2ηεt x(t+1) u(t) ηi

σ2 i Diag(α) m g(x(t))

end for end for

This deﬁnes a joint distribution pγ(x, m) = p(x)pγ(m|x) over signal components x and mixtures m, and a corresponding posterior distribution

pγ(x|m) = p(x)pγ(m|x)/pγ(m). (4)

In the limit as γ2 0, we recover the hard constraint on the mixture m given by Equation (1).

BASIS separation (Algorithm 1) presents an approach to sampling from (4) based on the discussion in Sections 3.1 and 3.2. In Section 3.3 we discuss the behavior of the gradients x log p(x), which motivates some of the hyperparameter choices in Section 3.4. We describe a procedure to construct the noisy models pσi required for BASIS in Section 3.5.

3.1. Langevin dynamics

Sampling from the posterior distribution pγ(x|m) looks formidable; just computing Equation (4) requires evaluation of the partition function pγ(m). But using Langevin dynamics (Neal et al., 2011; Welling & Teh, 2011) we can sample x pγ( |m) while avoiding explicit computation of pγ(x|m). Let x0 Uniform(X), εt N(0, I), and deﬁne a sequence

x(t+1) x(t)+η x log pγ(x(t)|m)+ p

= x(t)+η x log p(x(t))+ 1 2γ2 m g(x(t)) 2 + p

Observe that x log pγ(m) = 0, so this term is not required to compute (5). By standard analysis of Langevin dynamics, as the step size η 0, limt DKL(xt x|m) = 0, under regularity conditions on the distribution pγ(x|m).

If the prior p(x) is parameterized by a neural model, then gradients x log p(x) can be computed by automatic differentiation with respect to the inputs of the generator network. This family of likelihood-based models includes autoregressive models (Salimans et al., 2017; Parmar et al., 2018), the variational autoencoder (Kingma & Welling, 2014; van den Oord et al., 2017), or ﬂow-based models (Dinh et al., 2017;

Source Separation with Deep Generative Priors

Kingma & Dhariwal, 2018). Alternatively, if gradients of the distribution are modeled (Song & Ermon, 2019), then x log p(x) can be used directly.

3.2. Accelerated mixing

To accelerate mixing of (5) we adopt a simulated annealing schedule over noisy approximations to the model p(x), extending the unconditional sampling algorithm proposed in Song & Ermon (2019) to accelerate sampling from the posterior distribution pγ(x|m). Let pσ(x) denote the distribution of x + ϵσ for x p and ϵσ N(0, σ2I). We deﬁne the noisy joint likelihood pσ,γ(x, m) pσ(x)pγ(m|x), which induces a noisy posterior approximation pσ,γ(x|m). At high noise levels σ, pσ(x) is approximately Gaussian and irreducible, so the Langevin dynamics (5) will mix quickly. And as σ 0, DKL(pσ p) 0. This motivates deﬁning the modiﬁed Langevin dynamics

x(t+1) x(t) + η x log pσ,γ(x(t)|m) + p

The dynamics (6) approximate samples from p(x|g(x) = m) as η 0, γ2 0, σ2 0, and t . An implementation of these dynamics, annealing η, γ2, and σ2 as t according to the hyper-parameter settings presented in Section 3.4, is presented in Algorithm 1.

We anneal η, γ2, and σ2 using a heuristic introduced in Song & Ermon (2019): the idea is to maintain a constant signal-to-noise ratio (SNR) between the expected size of the posterior log-likelihood gradient term η x log pσ,γ(x|m) and the expected size of the Langevin noise 2ηε:

" η x log pσ,γ(x|m) 2η

h x log pγ(m|x) + x log pσ(x) 2i . (7)

Assuming that gradients w.r.t. to the likelihood and the prior are uncorrelated, the SNR is approximately

h x log pγ(m|x) 2i +η

h x log pσ(x) 2i .

Observe that log pγ(m|x) is a concave quadratic with smoothness proportional to 1/γ2; it follows analytically that

E h x log pγ(m|x) 2i 1/γ2. Song & Ermon (2019)

found empirically that E x log pσ(x) 2 1/σ2 for the NCSN model; we observe similar behavior for the ﬂowbased Glow model (Kingma & Dhariwal, 2018) and in Section 3.3 we propose a possible explanation for this behavior. Therefore, to maintain a constant SNR, it sufﬁces to set both γ2 and σ2 proportional to η.

1.0 0.599 0.359 0.215 0.129 0.077 0.046 0.027 0.016 0.01 Standard deviation of the noise

Noise x Gradient

Figure 2. The behavior of σ x log pσ(x) in expectation for the NCSN (orange) and Glow (blue) models trained on CIFAR-10 at each of 10 noise levels as σ decays geometrically from 1.0 to 0.01. For large σ, x log pσ(x) 50/σ. This proportional relationship breaks down for smaller σ. Because the expected gradient of the noiseless density log p(x) is ﬁnite, its product with σ must asymptotically approach zero as σ 0.

3.3. The gradients of the noisy prior

We remark that the empirical ﬁnding E x log pσ(x) 2 1/σ2 discussed in Section 3.2, and the consistency of this observation across models and datasets, could be surprising. Gradients of the noisy densities pσ can be described by convolution of p with a Gaussian kernel:

x log pσ(x) = x log E ϵ N(0,I) [p(x σϵ)] . (9)

From this expression, assuming p is continuous, we clearly see that the gradients are asymptotically independent of σ:

lim σ 0 x log pσ(x) = x log p(x). (10)

Maintaining proportionality E x log pσ(x) 2 1/σ2 requires the gradients to grow unbounded as σ 0, but the gradients of the noiseless distribution log p(x) are ﬁnite. Therefore, proportionality must break down asymptotically and we conclude that even though we turn the noise σ2

down to visually imperceptible levels we have not reached the asymptotic regime.

We conjecture that the proportionality between the gradients and the noise is a consequence of severe non-smoothness in the noiseless model p(x). The probability mass of this distribution is peaked around plausible images x, and decays rapidly away from these points in most directions. Consider the extreme case where the prior has a Dirac delta point mass. The convolution of a Dirac delta with a Gaussian is itself Gaussian so, near the point mass, the noisy distribution pσ will be proportional to a Gaussian density with variance σ2. If pσ were exactly Gaussian then analytically

x log pσ(x) 2 = 1

Because the distribution p(x) does not contain actual delta spikes only approximations thereof we would expect this proportionality to eventually break down as σ 0. Indeed, Figure 2 shows that both for NCSN and Glow models of

Source Separation with Deep Generative Priors

Original Images

Simple Gradient

Gradient Ascent + Noise Conditioning

Langevin Dynamics +

Noise Conditioning

Figure 3. Non-stochastic gradient ascent produces sub-par results. Annealing over smoothed-out distributions (Noise Conditioning) guides the optimization towards likely regions of pixel space, but gets stuck at sub-optimal solutions. Adding Gaussian noise to the gradients (Langevin dynamics) shakes the optimization trajectory out of bad local optima.

CIFAR-10, after maintaining a very consistent proportionality E x log pσ(x) 2 1/σ2 at the higher noise levels, the decay of σ2 to zero eventually outpaces the growth of the gradients.

3.4. Hyper-parameter settings

We adopt the hyper-parameters proposed by Song & Ermon (2019) for annealing σ2, the proportionality constant δ, and the iteration count T. The noise σ is geometrically annealed from σ1 = 1.0 to σL = 0.01 with L = 10. We set δ = 2 10 5, and T = 100. We ﬁnd that the same proportionality constant between σ2 and η also works well for γ2 and η, allowing us to set γ2 = σ2. We use these hyper-parameters for both the NCSN and Glow models, applied to each of the three datasets MNIST, CIFAR-10, and LSUN.

3.5. Constructing noise-conditioned models

For noise-conditioned score networks, we can directly compute x log pσ(x) by evaluating the score network at the desired noise level. For generative ﬂow models like Glow, these noisy distributions are not directly accessible. We could estimate the distributions pσ(x) by training Glow from scratch on datasets perturbed by each of the required noise levels σ2. But this not practical; Glow is expensive to train, requiring thousands of epochs to converge and consuming hundreds of gpu-hours to obtain good models even for small low-resolution datasets.

Instead of training models pσ(x) from scratch, we apply the concept of ﬁne-tuning from transfer learning (Yosinski et al., 2014). Using pre-trained models of p(x) published by the Glow authors, we ﬁne-tune these models on noiseperturbed data x + ϵ, where ϵ N(0, σ2I). Empirically, this procedure quickly converges to an estimate of pσ(x), within about 10 epochs.

3.6. The importance of stochasticity

We remark that adding Gaussian noise to the gradients in the BASIS algorithm is essential. If we set aside the Bayesian perspective, it is tempting to simply run gradient ascent on the pixels of the components to maximize the likelihood of these components under the prior, with a Lagrangian term to enforce the mixture constraint g(x) = m:

x x + η x log p(x) λ g(x) m 2 . (12)

But this does not work. As demonstrated in Figure 3, there are many local optima in the loss surface of p(x) and a greedy ascent procedure simply gets stuck. Pragmatically, the noise term in Langevin dynamics can be seen as a way to knock the greedy optimization (12) out of local maxima.

In the recent literature, pixel-space optimizations by following gradients x of some objective are perhaps associated more with adversarial examples than with desirable results (Goodfellow et al., 2015; Nguyen et al., 2015). We note that there have been some successes of pixel-wise optimization in texture synthesis (Gatys et al., 2015) and style transfer (Gatys et al., 2016). But broadly speaking, pixel-space optimization procedures often seem to go wrong. We speculate that noisy optimizations (6) on smoothed-out objectives like pσ could be a widely applicable method for making pixel-space optimizations more robust.

4. Evaluation Methodology

Many previous works on source separation evaluate their results using peak signal-to noise ratio (PSNR) or structural similarity index (SSIM) (Wang et al., 2004). These metrics assume that the original sources are identiﬁable; in probabilistic terms, the true posterior distribution p(x|m) is presumed to have a unique global maximum achieved by the

Source Separation with Deep Generative Priors

ground truth sources (up to permutation of the sources). Under the identiﬁability assumption, it is reasonable to measure the quality of a separation algorithm by comparing separated sources to ground truth mixture components. PSNR, for example, evaluates separations by computing the meansquared distance between pixel values of the ground truth and separated sources on a logarithmic scale.

For CIFAR-10 source separation, the ground truth source components of a mixture are not identiﬁable. As evidence for this claim, we call the reader s attention to Figure 4. For each mixture depicted in Figure 4, we present separation results that sum to the mixture and (to our eyes) look plausibly like CIFAR-10 images. However, in each case the separated images exhibit high deviation from the ground truth. This phenomenon is not unusual; Figure 5 shows an un-curated collection of samples from p(x|m) using BASIS, illustrating a variety of plausible separation results for each given mixture. We will later see evidence again of non-identiﬁability in Figure 7. If we accept that the separations presented in Figures 4, 5, and 7 are reasonable, then source separation on this dataset is fundamentally underdetermined; we cannot measure success using metrics like PSNR that compare separation results to ground truth.

Instead of comparing separations to ground truth, we propose instead to quantify the extent to which the results of a source separation algorithm look like samples from the data distribution. If a pair of images sum to the given mixture and look like samples from the data distribution, we deem the separation to be a success. This shift in perspective from identiﬁability of the latent components to the quality of the separated components is analogous to the classical distinction in the statistical literature between estimation and prediction (Shmueli et al., 2010; Bellec et al., 2018). To this end, we borrow the Inception Score (IS) (Salimans et al., 2016) and Frechet Inception Distance (FID) (Heusel et al., 2017) metrics from the generative modeling literature to evaluate CIFAR-10 separation results. These metrics attempt to quantify the similarity between two distributions given samples. We use them to compare the distribution of components produced by a separation algorithm to the distribution of ground truth images.

In contrast to CIFAR-10, the posterior distribution p(x|m) for an MNIST model is demonstrably peaked. Moreover, BASIS is able to consistently identify these peaks. This constitutes a constructive proof that components of MNIST mixtures are identiﬁable, and therefore comparisons to the ground-truth components make sense. We report PSNR results for MNIST, which allows us to compare the results of BASIS to other recent work on MNIST image separation (Halperin et al., 2019; Kong et al., 2019).

Orig Mix Sep

Color Ambiguities Structure Ambiguities

Orig Mix Sep

Figure 4. A curated collection of examples demonstrating color and structural ambiguities in CIFAR-10 mixtures. In each case, the original components differ substantially from the components separated by BASIS using NCSN as a prior. But in each case, the separation results also look like plausible CIFAR-10 images.

5. Experiments

We evaluate results of BASIS on 3 datasets: MNIST (Le Cun et al., 1998) CIFAR-10 (Krizhevsky, 2009) and LSUN (Yu et al., 2015). For MNIST and CIFAR-10, we consider both NCSN (Song & Ermon, 2019) and Glow (Kingma & Dhariwal, 2018) models as priors, using pre-trained weights published by the authors of these models. For LSUN there is no pre-trained NCSN model, so we consider results only with Glow. For Glow, we ﬁne-tune the weights of the pretrained models to construct noisy models pσ using the procedure described in Section 3.5. Code and instructions for reproducing these experiments is available online.1

Baselines. On MNIST we compare to results reported for the GAN-based S-D method (Kong et al., 2019) and the fully supervised version of Neural Egg separation NES (Halperin et al., 2019). Results for MNIST are presented in Section 5.1. To the best of our knowledge there are no previously reported quantitative metrics for CIFAR-10 separation, so as a baseline we ran Neural Egg separation on CIFAR-10 using the authors published code. CIFAR-10 results are presented in Section 5.2. We present additional qualitative results for 64 64 LSUN in Section 5.3, which demonstrate that BASIS scales to larger images.

We also consider results for a simple baseline, Average, that separates a mixture m into two 50% masks x1 = x2 = m/2. This is a surprisingly competitive baseline. Observe that if we had no prior information about the distribution of components, and we measure separation quality by PSNR, then by a symmetry argument setting x1 = x2 is the optimal

1https://github.com/jthickstun/basis-separation

Source Separation with Deep Generative Priors

Orig Mix Resampled Separations

Figure 5. Repeated sampling using BASIS with NCSN as a prior for several mixtures of CIFAR-10 images. While most separations look reasonable, variation in color and lighting makes comparative metrics like PSNR unreliable. This challenges the notion that the ground truth components are identiﬁable.

separation strategy in expectation. In principle we would expect Average to perform very poorly under IS/FID, because these metrics purport to measure similarity of distributions and mixtures should have little or no support under the data distribution. But we ﬁnd that IS and FID both assign reasonably good scores to Average, presumably because mixtures exhibit many features that are well supported by the data distribution. This speaks to well-known difﬁculties in evaluating generative models (Theis et al., 2016) and could explain the strength of Average as a baseline.

We remark that we cannot compare our algorithm to the separation-like task reported for Capsule Nets (Sabour et al., 2017). The segmentation task discussed in that work is similar to source separation, but the mixtures used for the segmentation task are constructed using the non-linear threshold function h(x) = max(x1 + x2, 1), in contrast to our linear function g. While extending the techniques of this paper to non-linear relationships between x and m is intriguing, we leave this to future this work.

Class conditional separation. The Neural Egg separation algorithm is designed with the assumption that the components xi are drawn from different distributions. For quantitative results on MNIST and CIFAR-10, we therefore consider two slightly different tasks. The ﬁrst is class-agnostic, where we construct mixtures by summing randomly selected images from the test set. The second is class-conditional, where we partition the test set into two groupings: digits 0 4 and 5 9 for MNIST, animals and machines for CIFAR-10. The former task allows us compare to S-D results on MNIST, and the latter task allows us to compare to

Neural Egg separation on MNIST and CIFAR-10.

There are two different ways to apply a prior for classconditional separation. First observe that, because x1 and x2 are chosen independently,

p(x) = p(x1, x2) = p1(x1)p2(x2). (13)

In the class agnostic setting, x1 and x2 are drawn from the same distribution (the empirical distribution of the test set) so it makes sense to use a single prior p = p1 = p2. In the class conditional setting, we could potentially use separate priors over components x1 and x2. For the MNIST and CIFAR-10 experiments in this paper, we use pre-trained models trained on unconditional distribution of the training data for both the class agnostic and class conditional setting. It is possible that better results could be achieved in the class conditional setting by re-training the models on class conditional training data. For LSUN, the authors of Glow provide separate pre-trained models for the Church and Bedroom categories, so we are able to demonstrate class-conditional LSUN separations using distinct priors in Section 5.3.

Sample Likelihoods. Although we do not directly model the posterior likelihood p(x|m), we can compute the loglikelihood of the output samples x. The log-likelihood is a function of the artiﬁcial variance hyper-parameter γ, so it is more informative to look at the unweighted square error m g(x) 2; this quantity can be interpreted as a reconstruction error, and measures how well we approximate the hard mixture constraint. Because we geometrically anneal the variance γ, by the end of optimization the mixture constraint is rigorously enforced; per-pixel reconstruction error is smaller than the quantization level of 8-bit color, resulting in pixel-perfect visual reconstructions.

For Glow, we can also compute the log-probability of samples under the prior. How do the probabilities of sources x BASIS constructed by BASIS separation compare to the probabilities of data xtest taken directly from a dataset s test set? Because we anneal the noise to a ﬁxed level σL > 0, we ﬁnd it most informative to ask this question using the minimal-noise, ﬁne-tuned prior pσL(x). As seen in Table 1, the outputs of BASIS separation are generally comparable in log-likelihood to test set images; BASIS separation recovers sources deemed typical by the prior.

Table 1. The mean log-likelihood under the minimal-noise Glow prior pσL(x) for the test set xtest, and for samples of 100 BASIS separations x BASIS. The log-likelihood of each test set under the noiseless prior p(xtest) is reported for reference.

Dataset p(xtest) pσL(xtest) pσL(x BASIS) MNIST 0.5 3.6 3.6 CIFAR-10 3.4 4.5 4.7 LSUN (bed) 2.4 4.2 4.4 LSUN (crh) 2.7 4.4 4.4

Source Separation with Deep Generative Priors

5 10 15 20 25 30 35 40 45 Peak Signal-to-Noise Ratio (PSNR)

Figure 6. The empirical distribution of PSNR for 5,000 class agnostic MNIST digit separations using BASIS with the NCSN prior (see Table 2 for comparison of the central tendencies of this and other separation methods).

5.1. MNIST separation

Quantitative results for MNIST image separation are reported in Table 2, and a panel of visual separation results are presented in Figure 1. For quantitative results, we report mean PSNR over separations of 12, 000 separated components. The distribution of PSNR for class agnostic MNIST separation is visualized in Figure 6. We observe that approximately 2/3 of results exceed the mean PSNR of 29.5, which to our eyes is visually indistinguishable from ground truth.

A natural approach to improve separation performance is to sample multiple x p( |m) for a given mixture m. A major advantage of models like Glow, that explicitly parameterize the prior p(x), is that we can approximate the maximum of the posterior distribution with the maximum over multiple samples. By construction, samples from BASIS approximately satisfy g(x) = m, so for the noiseless model we simply declare p(m|x) = 1 and therefore p(x|m) p(x). We demonstrate the effectiveness of resampling in Table 2 (Glow, 10x) by comparing the expected PSNR of x p( |m) to the expected PSNR of arg maxi p(xi) over 10 samples x1, . . . , x10 p( |m). Even moderate resampling dramatically improves separation performance. Unfortunately this approach cannot be applied to the otherwise superior NCSN model, which does not model explicit likelihoods p(x).

Table 2. PSNR results for separating 6,000 pairs of equally mixed MNIST images. For class split results, one image comes from label 0 4 and the other comes from 5 9. We compare to S-D (Kong et al., 2019), NES (Halperin et al., 2019), convolutional NMF (class split) (Halperin et al., 2019) and standard NMF (class agnostic) (Kong et al., 2019).

Algorithm Class Split Class Agnostic Average 14.8 14.9 NMF 16.0 9.4 S-D - 18.5 BASIS (Glow) 22.9 22.7 NES 24.3 - BASIS (Glow, 10x) 27.7 27.1 BASIS (NCSN) 29.5 29.3

Without any modiﬁcation, we can apply BASIS to separate mixtures of k > 2 images. We contrast this with regressionbased methods, which require re-training to target varying numbers of components. Figure 1 shows the results of BASIS using the NCSN prior applied to mixtures of four randomly selected images. For more mixture components, we observe that identiﬁability of ground truth sources begins to break down. This is illustrated by looking at the central item in each panel of Figure 1 (highlighted in orange).

5.2. CIFAR-10

Quantitative results for CIFAR-10 image separation measured are presented in Table 3, and visual separation results are presented in Figure 1.

We can also view image colorization (Levin et al., 2004; Zhang et al., 2016) as a source separation problem by interpreting a grayscale image as a mixture of the three color channels of an image x = (xr, xg, xb) with

g(x) = (xr + xg + xb)/3. (14)

Unlike our previous separation problems, the channels of an image are clearly not independent, and the factorization of p given by Equation 13 is unwarranted. But conveniently, a generative model trained on color CIFAR-10 images itself models the joint distribution p(x) = p(xr, xg, xb). Therefore, the same pre-trained generative model that we use to separate images can also be used to color them.

Qualitative colorization results are visualized in Figure 7. The non-identiﬁability of ground truth is profound for this task (see Section 4 for discussion of identiﬁability). We draw attention to the two cars in the middle of the panel: the white car that is colored yellow by the algorithm, and the blue car that is colored red. The colors of these speciﬁc cars cannot be inferred from a greyscale image; the best an

Table 3. Inception Score / FID Score of 25,000 separations (50,000 separated images) of two overlapping CIFAR-10 images using NCSN as a prior. In Class Split one image comes from the category of animals and other from the category of vehicles. NES results using published code from Halperin et al. (2019).

Algorithm Inception Score FID Class Split NES 5.29 0.08 51.39 BASIS (Glow) 5.74 0.05 40.21 Average 6.14 0.11 39.49 BASIS (NCSN) 7.83 0.15 29.92 Class Agnostic BASIS (Glow) 6.10 0.07 37.09 Average 7.18 0.08 28.02 BASIS (NCSN) 8.29 0.16 22.12

Source Separation with Deep Generative Priors

Grayscale (Input) Colorization Original

Figure 7. Colorizing CIFAR-10 images. Left: original CIFAR-10 images. Middle: greyscale conversions of the images on the left. Right: imputed colors for the greyscale images, found by BASIS using NCSN as a prior.

Table 4. Inception Score / FID Score of 50,000 colorized CIFAR10 images. As measured by IS/FID, the quality of NCSN colorizations nearly matches CIFAR-10 itself.

Data Distribution Inception Score FID Score Input Grayscale 8.01 0.10 68.52 BASIS (Glow) 8.69 0.15 28.70 BASIS (NCSN) 10.53 0.17 11.58 CIFAR-10 Original 11.24 0.12 0.00

algorithm can do is to choose a reasonable color, based on prior information about the colors of cars.

Quantitative coloring results for CIFAR-10 are presented in Table 4. We remark that the IS and FID scores for coloring are substantially better than the IS and FID scores of 8.87 and 25.32 respectively reported for unconditional samples from the NCSN model; conditioning on a greyscale image is enormously informative. Indeed, the Inception Score of NCSN-colorized CIFAR-10 is close to the Inception Score of the CIFAR-10 dataset itself.

5.3. LSUN separation

Qualitative results for LSUN separations are visualized in Figure 8. While the separation results in Figure 8 are imperfect, Table 1 shows that the mean log-likelihood of the separated components is comparable to the mean log-likelihood that the model assigns to images in the test set. This suggests that the model is incapable of distinguishing these separations from better results, and the imperfections are attributable to the quality of the model rather than to the separation algorithm. This is encouraging, because it suggests that the artifacts are due to the Glow model rather than the BASIS separation algorithm, and that better separation results will be achievable with improved generative models.

Mixture (Input)

Figure 8. 64 64 LSUN separation results using Glow as a prior. One mixture component is sampled from the LSUN churches category, and the other component is sampled from LSUN bedrooms.

6. Conclusion

In this paper, we introduced a new approach to source separation that makes use of a likelihood-based generative model as a prior. We demonstrated the ability to swap in different generative models for this purpose, presenting results of our algorithm using both NCSN and Glow. We proposed new methodology for evaluating source separation on richer datasets, demonstrating strong performance on MNIST and CIFAR-10. Finally, we presented qualitative results on LSUN that point the way towards scaling this method to practical tasks such as speech separation, using generative audio models like Wave Nets (Oord et al., 2016).

Acknowledgements

We thank Zaid Harchaoui, Sham M. Kakade, Steven Seitz, and Ira Kemelmacher-Shlizerman for valuable discussion and computing resources. This work was supported by the National Science Foundation Grant DGE-1256082.

Source Separation with Deep Generative Priors

Bell, A. J. and Sejnowski, T. J. An informationmaximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129 1159, 1995.

Bellec, P. C., Lecu e, G., Tsybakov, A. B., et al. Slope meets lasso: improved oracle bounds and optimality. The Annals of Statistics, 46(6B):3603 3642, 2018.

Benaroya, L., Bimbot, F., and Gribonval, R. Audio source separation with a single sensor. IEEE Transactions on Audio, Speech, and Language Processing, 14(1):191 199, 2005.

Comon, P. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994.

Davies, M. E. and James, C. J. Source separation using single channel ica. Signal Processing, 87(8):1819 1832, 2007.

D efossez, A., Usunier, N., Bottou, L., and Bach, F. Music source separation in the waveform domain. ar Xiv preprint ar Xiv:1911.13254, 2019.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. International Conference on Learning Representations, 2017.

Gandelsman, Y., Shocher, A., and Irani, M. Double-dip : Unsupervised image decomposition via coupled deepimage-priors. In The IEEE Conference on Computer Vision and Pattern Recognition, volume 6, pp. 2, 2019.

Gatys, L., Ecker, A. S., and Bethge, M. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 262 270, 2015.

Gatys, L. A., Ecker, A. S., and Bethge, M. Image style transfer using convolutional neural networks. In Conference on Computer Vision and Pattern Recognition, pp. 2414 2423, 2016.

Geman, S. and Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Transactions on Pattern Analysis and Machine Intelligence, (6):721 741, 1984.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. International Conference on Learning Representations, 2015.

Halperin, T., Ephrat, A., and Hoshen, Y. Neural separation of observed and unobserved distributions. Advances in Neural Information Processing Systems, 2019.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626 6637, 2017.

Huang, P.-S., Chen, S. D., Smaragdis, P., and Hasegawa Johnson, M. Singing-voice separation from monaural recordings using robust principal component analysis. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 57 60. IEEE, 2012.

Huang, P.-S., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. Singing-voice separation from monaural recordings using deep recurrent neural networks. In International Symposium on Music Information Retrieval, pp. 477 482, 2014.

Huang, P.-S., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(12):2136 2147, 2015.

Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., and Weyde, T. Singing voice separation with deep u-net convolutional networks. 2017.

Kingma, D. P. and Dhariwal, P. Glow: Generative ﬂow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215 10224, 2018.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. International Conference on Learning Representations, 2014.

Kong, Q., Xu, Y., Jackson, P. J. B., Wang, W., and Plumbley, M. D. Single-channel signal separation and deconvolution with generative adversarial networks. In International Joint Conference on Artiﬁcial Intelligence, 2019.

Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.

Lee, D. D. and Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755): 788, 1999.

Lee, D. D. and Seung, H. S. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, pp. 556 562, 2001.

Lee, T.-W., Lewicki, M. S., Girolami, M., and Sejnowski, T. J. Blind source separation of more sources than mixtures using overcomplete representations. IEEE signal processing letters, 6(4):87 90, 1999.

Source Separation with Deep Generative Priors

Levin, A., Lischinski, D., and Weiss, Y. Colorization using optimization. In ACM SIGGRAPH 2004 Papers, pp. 689 694. 2004.

Lluis, F., Pons, J., and Serra, X. End-to-end music source separation: is it possible in the waveform domain? Interspeech, 2019.

Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.

Nguyen, A., Yosinski, J., and Clune, J. Deep neural networks are easily fooled: High conﬁdence predictions for unrecognizable images. In Conference on Computer Vision and Pattern Recognition, pp. 427 436, 2015.

Nishida, S., Nakamura, M., Ikeda, A., and Shibasaki, H. Signal separation of background eeg and spike by using morphological ﬁlter. Medical engineering & physics, 21 (9):601 608, 1999.

Nugraha, A. A., Liutkus, A., and Vincent, E. Multichannel audio source separation with deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(9):1652 1664, 2016.

Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016.

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., and Tran, D. Image transformer. International Conference on Machine Learning, 2018.

Pearlmutter, B. A. and Parra, L. C. Maximum likelihood blind source separation: A context-sensitive generalization of ica. In Advances in Neural Information Processing Systems, pp. 613 619, 1997.

Roweis, S. T. One microphone source separation. In Advances in Neural Information Processing Systems, pp. 793 799, 2001.

Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3856 3866, 2017.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234 2242, 2016.

Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modiﬁcations. International Conference on Learning Representations, 2017.

Schmidt, M. N. and Olsson, R. K. Single-channel speech separation using sparse non-negative matrix factorization. In International Conference on Spoken Language Processing, 2006.

Shmueli, G. et al. To explain or to predict? Statistical Science, 25(3):289 310, 2010.

Smaragdis, P. and Venkataramani, S. A neural network alternative to non-negative audio models. In International Conference on Acoustics, Speech and Signal Processing, pp. 86 90. IEEE, 2017.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11895 11907, 2019.

Spiertz, M. and Gnann, V. Source-ﬁlter based clustering for monaural blind source separation. In International Conference on Digital Audio Effects, 2009.

Stoller, D., Ewert, S., and Dixon, S. Adversarial semisupervised audio source separation applied to singing voice extraction. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2391 2395. IEEE, 2018a.

Stoller, D., Ewert, S., and Dixon, S. Wave-u-net: A multiscale neural network for end-to-end audio source separation. International Symposium on Music Information Retrieval, 2018b.

Subakan, Y. C. and Smaragdis, P. Generative adversarial source separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 26 30. IEEE, 2018.

Theis, L., Oord, A. v. d., and Bethge, M. A note on the evaluation of generative models. International Conference on Learning Representations, 2016.

Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image prior. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446 9454, 2018.

van den Oord, A., Vinyals, O., et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, 2017.

Venkataramani, S., Subakan, C., and Smaragdis, P. Neural network alternatives toconvolutive audio models for source separation. In International Workshop on Machine Learning for Signal Processing, pp. 1 6. IEEE, 2017.

Virtanen, T. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE transactions on audio, speech, and language processing, 15(3):1066 1074, 2007.

Source Separation with Deep Generative Priors

Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600 612, 2004.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning, pp. 681 688, 2011.

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pp. 3320 3328, 2014.

Yu, F., Zhang, Y., Song, S., Seff, A., and Xiao, J. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. ar Xiv preprint ar Xiv:1506.03365, 2015.

Zhang, R., Isola, P., and Efros, A. A. Colorful image colorization. In European conference on computer vision, pp. 649 666. Springer, 2016.

Zhang, X., Ng, R., and Chen, Q. Single image reﬂection separation with perceptual losses. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4786 4794, 2018.