# neural_separation_of_observed_and_unobserved_distributions__c37b2cc7.pdf

Neural Separation of Observed and Unobserved Distributions

Tavi Halperin 1 Ariel Ephrat 2 Yedid Hoshen 1 3

Separating mixed distributions is a long standing challenge for machine learning and signal processing. Most current methods either rely on making strong assumptions on the source distributions or rely on having training samples of each source in the mixture. In this work, we introduce a new method Neural Egg Separation to tackle the scenario of extracting a signal from an unobserved distribution additively mixed with a signal from an observed distribution. Our method iteratively learns to separate the known distribution from progressively ﬁner estimates of the unknown distribution. In some settings, Neural Egg Separation is initialization sensitive, we therefore introduce Latent Mixture Masking which ensures a good initialization. Extensive experiments on audio and image separation tasks show that our method outperforms current methods that use the same level of supervision, and often achieves similar performance to full supervision.

1. Introduction

Humans are remarkably good at separating data coming from a mixture of distributions, e.g. hearing a person speaking in a crowded cocktail party. Artiﬁcial intelligence, on the the hand, is far less adept at separating mixed signals. This is an important ability as signals in nature are typically mixed, e.g. speakers are often mixed with other speakers or environmental sounds, objects in images are typically seen along other objects as well as the background. Understanding mixed signals is harder than understanding pure sources, making source separation an important research topic.

Most previous work focused on the following settings:

Full supervision: The learner has access to a training set

1Department of Computer Science, The Hebrew University of Jerusalem, Jerusalem, Israel 2Google Research 3Facebook AI Research. Correspondence to: Tavi Halperin <tavi.halperin@mail.huji.ac.il >.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

including samples of mixed signals {y} Y as well as the ground truth sources of the same signals {b} B and {x} X (such that y = x + b). Having such strong supervision is very potent, allowing the learner to directly learn a mapping from the mixed signal y to its sources (x, b). Obtaining such strong supervision is nearly never possible, as it requires knowing for each input mixture y, its exact separate signals (x, b). This information is rarely available in a single microphone setting.

Synthetic full supervision: The learner has access to a training set containing samples from the mixed signal {y} Y as well as samples from all source distributions {b} B and {x} X. The learner however does not have access to paired sets of the mixed and unmixed signal ground truth (that is for any given y in the training set, its separate b and x are unknown). This supervision setting is more realistic than the fully supervised case, and occurs when each of the source distributions can be sampled in isolation (e.g. we can record a violin and piano separately in a studio and can thus obtain unmixed samples of each of their distributions). It is typically solved by learning to separate synthetic mixtures b + x of independently sampled b and x.

No supervision: The learner only has access to training samples of the mixed signal Y but not to sources B and X. Although this settings puts the least requirements on the training dataset, it is a hard problem and can be poorly speciﬁed in the absence of strong assumptions and priors. It is generally necessary to make strong assumptions on the properties of the component signals (e.g. smoothness, low rank, periodicity) in order to make progress in separation. This limits the applicability of such methods.

In this work we concentrate on the semi-supervised setting: unmixing of signals in the case where the mixture Y consists of a signal coming from an unobserved distribution X and another signal from an observed distribution B (i.e. the learner has access to a training set of clean samples such that {b} B along with different mixed samples {y } Y). One possible way of obtaining such supervision is to label every element of a signal (e.g. short waveform segment) by a label, indicating if it comes only from the observed distribution B or if it is a mixture of both distributions B+X. The task is to learn a parametric function able to separate the mixed signal y Y into sources x X and b B s.t.

Neural Separation of Observed and Unobserved Distributions

y = b + x. Such supervision is more generally available than full supervision, while the separation problem becomes much simpler than when fully unsupervised.

We introduce a novel method: Neural Egg Separation (NES), consisting of iterative: i) estimation of samples from the unobserved distribution X ii) synthesis of mixed signals from known samples of B and estimated samples of X iii) training of separation function for the mixed signal. Iterative reﬁnement of the estimated samples of X signiﬁcantly increases the accuracy of the learned separation function. The method is named Neural Egg Separation, as it is akin to the iterative technique commonly used for separating egg whites and yolks.

As an iterative technique, NES can be initialization sensitive. We therefore introduce another method Latent Mixture Masking (LMM) to provide NES with a strong initialization. Our method trains two deep generators end-to-end using Latent Mixtures to model the observed and unobserved sources (B and X). we found that a simple initialization is sufﬁcient when X and B are uncorrelated, whereas LMMinitialization is most important when X and B are strongly correlated such as e.g. separation of music into instruments and vocals. Initialization by LMM was found to be much more effective than by adversarial methods.

Experiments are conducted across multiple domains (image, music, voice) validating the effectiveness of our method, and its superiority over current methods that use the same level of supervision. Our semi-supervised method is often competitive with the fully supervised baseline, while making few assumptions on the nature of the component signals and requiring lightweight supervision. An analysis of the assumptions made by the method is detailed in Sec. 5.

2. Previous Work

Source separation: Separation of mixed signals has been extensively researched. In this work, we focus on single channel separation. Unsupervised (blind) single-channel methods include: Robust Principal Component Analysis (RPCA) (Huang et al., 2012) and Single-channel Independent Component Analysis (ICA) (Davies & James, 2007). These methods attempt to use coarse priors about the signals such as low rank, sparsity or non-gaussianity. Hidden Markov Models (HMM) can be used as a temporal prior for longer clips (Roweis, 2001), however here we do not assume long clips. Supervised source separation has also been extensively researched, classic techniques often used learned dictionaries for each source e.g. Non-negative Matrix Factorization (NMF) (Wilson et al., 2008). Recently, neural network-based separation gained popularity, usually learning a regression between the mixed and unmixed signals either directly (Huang et al., 2014), or by regressing

a multiplicative mask (Wang et al., 2014; Yu et al., 2017). Some methods were devised to exploit the temporal nature of long audio signal by using Reccurent Neural Networks (RNNs) (Mimilakis et al., 2017), in this work we concentrate on separation of short audio clips and consider such line of works as orthogonal. One related direction is Generative Adversarial Source Separation (Stoller et al., 2017; Subakan & Smaragdis, 2017) that uses adversarial training to match the unmixed source distributions. This is needed to deal with correlated sources for which learning a regressor on synthetic mixtures is less effective. We present an Adversarial Masking (AM) method that tackles the semi-supervised rather than the fully supervised scenario and overcomes mixture collapse issues not present in the fully supervised case. We found that non-adversarial methods perform better for the initialization task.

The most related set of works is semi-supervised audio source separation (Smaragdis et al., 2007; Barker & Virtanen, 2014), which like our work attempt to separate mixtures Y given only samples from the distribution of one source B. Typically NMF or Probabilistic Latent Component Analysis (PLCA) (which is a similar algorithm with a probabilistic formulation) are used. We show experimentally that our method signiﬁcantly outperforms NMF. A very early related technique is Spectral Subtraction (Boll, 1979), however it can only handle very simple unknown sources.

Disentanglement: Similarly to source separation, disentanglement also deals with separation in terms of creating a disentangled representation of a source signal, however its aim is to uncover latent factors of variation in the signal such as style and content or shape and color e.g. (Denton et al., 2017; Higgins et al., 2016). Differently from disentanglement, our task is separating signals rather than the latent representation.

Generative Models: Generative models learn the distribution of a signal directly. Classical approaches include: singular-value decomposition (SVD) for general signals and NMF (Lee & Seung, 2001) for non-negative signals. Recently several deep learning approaches dominated generative modeling including: Generative Adversarial Network (GAN) (Goodfellow et al., 2016), Variational Autoencoder (VAE) (Kingma & Welling, 2013) and Generative Latent Optimization (GLO) (Bojanowski et al., 2018). Adversarial training (for GANs) is rather tricky and often leads to modecollapse. GLO is non-adversarial and allows for direct latent optimization for each source making it more suitable than VAE and GAN.

3. Neural Egg Separation (NES)

In this section we present our method for separating a mixture of sources of known and unknown distributions. We

Neural Separation of Observed and Unobserved Distributions

denote the mixture samples y, the corresponding samples with the observed distribution b and the samples from the unobserved distribution x. Our objective is to learn a parametric function T(y), such that b = T(y).

Full Supervision: In the fully supervised setting, where matching pairs (y, b) are available, this task reduces to a standard supervised regression problem, in which a parametric function T(y) (typically a deep neural network) is used to directly optimize:

T = arg min T X

(y,b) L1(T (y), b) (1)

Mixed-unmixed pairs are usually unavailable, but in some cases it is possible to obtain a training set which includes independent samples from X and B e.g. (Wang et al., 2014; Yu et al., 2017). Methods typically randomly sample x X and b B sources and synthetically create mixtures y = x + b. The synthetic pairs (y , b) can then be used to optimize Eq. 1. Note that in cases where X and B are correlated (e.g. vocals and instrumental accompaniment which are temporally dependent), random synthetic mixtures of x and b might not be representative of Y and cause difﬁculty generalizing to real mixtures.

Semi-Supervision: In many scenarios, clean samples of both mixture components are not available. Consider for example a street musical performance. Although crowd noises without street performers can be easily observed, street music without crowd noises are much harder to come by. In this case, therefore, samples from the distribution of crowd noise B are available, whereas the samples from the distribution of the music X are unobserved. Samples from the distribution of the mixed signal Y i.e. the crowd noise mixed with the musical performance are also available.

The example above illustrates a class of problems for which the distribution of the mixture and a single source are available, but the distribution of another source is unknown. In such cases, it is not possible to optimize Eq. 1 directly due to the unavailability of paired (y, b).

Neural Egg Separation: Fully-supervised optimization (as in Eq. 1) is very effective when pairs of (y, b) are available. We present a novel algorithm, which iteratively solves the semi-supervised task as a sequence of supervised problems without any clean training examples of X.

The core idea of our method is that although no clean samples from X are given, it is still possible to learn to separate mixtures of observed samples b from distribution B combined with some estimates of the unobserved distribution samples xt (where t denotes the iteration of NES). Synthetic mixtures are created by randomly sampling an approximate sample xt from the unobserved distribution and combining with training sample b, thereby creating pairs (yt, b) for

Algorithm 1 Neural Egg Separation (NES)

Input: samples of: mixture {y}, observed source {b} Initialize: synthetic unobservable samples with x0 c y or using AM or LMM while t < N do

Initialize T() with random weights Synthesize mixtures yt = b + xt for all b in B with randomly sampled xt

Optimize separation function for P epochs: T t+1 = arg min T P

(yt,b) L1(T (yt), b) Update estimates of unobserved distribution samples: xt+1 = y T t+1(y) end while

supervised training:

yt = b + xt (2)

Note that the empirical distribution of synthetic mixtures Yt

might differ from the real mixture sample distribution Y. We show empirically that there are interesting cases for which it converges towards the correct distribution: Yt Y.

During each iteration of NES, a neural separation function T t+1(yt) is trained on the created pairs by optimizing the following term:

T t+1 = arg min T X

(yt,b) L1(T (yt), b) (3)

At the end of each iteration, the separation function T t() can be used to approximately separate the training mixture samples y into their sources:

xt = y T t(y) (4)

The reﬁned samples xt X t are used for creating synthetic pairs for training T t+1(yt) in the next iteration (as in Eq. 3).

The above method relies on having an estimate of the unobserved distribution samples as input to the ﬁrst iteration (X 0). One simple scheme is to initialize the estimates of the unobserved distribution samples in the ﬁrst iteration as x0 = c y, where c is a constant fraction (typically 0.5). Although this initialization is very naive, we show that it performs well where the sources are independent. More advanced initializations will be discussed below.

At test time, separation is carried out by applying the trained separation function T() (exactly as in Eq. 4).

Our full algorithm is described in Alg. 1. For optimization, we use SGD using ADAM update with a learning rate of

Neural Separation of Observed and Unobserved Distributions

0.001. In total we perform N = 10 iterations, each consisting of optimization of T and estimation of xt, P = 25 epochs are used for each optimization of Eq. 3.

Latent Mixtures: NES is an iterative method and relies on having a good initialization. It does not take into account correlation between X and B e.g. vocals and instrumental tracks are highly related, whereas randomly sampling pairs of vocals and instrumental tracks is likely to synthesize mixtures quite different from Y.

We present our method Latent Mixtures (LM), which separates mixtures by a distributional constraint enforced via latent generative modeling of the source signals. The method uses some latent optimization ideas from GLO (Bojanowski et al., 2018). The novelty of LM is using mixtures of GLO models for separation, which has not been done before. LM training consists of two stages. We ﬁrst learn a generator GB(), with which for every observed training sample b from B, a latent code zb can be found such that b is reconstructed by the generator: b = GB(zb). We learn end-to-end both the parameters of the generator GB() as well as a latent code zb for every training sample b. The per-sample latent codes are found by direct gradient descent over the values of zb (similar to word embeddings), rather than by a feedforward encoder. This stage is equivalent to GLO. The optimization is given by:

arg min zb,GB

b B ℓ(G(zb), b) (5)

Given the learned generator GB(z) for the B distribution, we learn generator GX(z) for the unobserved distribution X. The idea is that every Y domain training sample y, is described by mixing a B domain signal generated by GB(z B y ) as well as a X domain signal generated by GX(z X y ). Note that we do not know the actual source b but only have a generative model prior for the B distribution. As we have already learned GB() in the previous stage, we just need to learn GX() as well as the per y sample latent codes z X y and z B y . The optimization function is therefore:

arg min z X y ,z B y ,GX ℓ(GB(z B y ) + GX(z X y ), y) (6)

Similarly to (Bojanowski et al., 2018), we found that forcing latent codes to lay in a unit ball provides important regularization. We use ℓ= L1 except for color images, where we found it advantageous to use a VGG perceptual loss (implementation taken from (Hoshen & Wolf, 2018)).

Once GB() and GX() are trained, we infer the latent codes for a test mixture by:

arg min z X y ,z B y ℓ(GB(z B y ) + GX(z X y ), y) (7)

Our estimate for the sources is then:

b = GB(z B y ) x = GX(z X y ) (8)

Masking Function: In additive separation tasks, the mixed signal y is the sum of two positive signals x and b. Instead of synthesizing the new sample, we can learn a neural network separation mask m(y) that speciﬁes the fraction of the signal which comes from B at each pixel. The attractive feature of the mask is always being in the range [0, 1] (in the case of positive additive mixtures of signals). Even a constant mask will preserve all signal gradients (at the cost of introducing spurious gradients). Mathematically this can be written as:

T(y) = y m(y) (9)

For NES (and baseline AM described below), we implement the mapping function T(y) using the element-wise product of the masking function and the mixture signal: y m(y). In practice, we ﬁnd that learning a masking function yields much better results than synthesizing the signal directly (in line with other works e.g. (Wang et al., 2014; Gabbay et al., 2017)).

LM does not provide a way for learning the mask directly. We reﬁne its estimate by computing an effective mask from the element-wise ratio of estimated sources. We name the combination of LM and the post-processing masking operation, Latent Mixture Masking (LMM):

m LMM(y) = GB(z B y ) GB(z B y ) + GX(z X y ) (10)

Initializing Neural Egg Separation by LMM: We devise the following method: i) Train LMM on the training set and infer the mask for each mixture. This is operated on images or mel-scale spectrograms at 64 64 resolution ii) For audio: upsample the mask to the resolution of the high-resolution linear spectrogram and compute an estimate of the X source linear spectrogram on the training set iii) Run NES on the observed B and estimated X. We ﬁnd experimentally that this initialization scheme improves NES to the point of being competitive with fully-supervised training.

4. Experiments

To evaluate the performance of our method, we conducted experiments on distributions taken from multiple real-world domains: images, speech and music, in cases where the two signals are correlated and uncorrelated.

We evaluated our method against 3 baseline methods:

Constant Mask (Const): This baseline uses the original mixture as the estimate.

Neural Separation of Observed and Unobserved Distributions

Figure 1. A Qualitative Separation Comparison on Mixed Bag and Shoe Images Const NMF AM LMM NES LMM+NES Sup GT

Semi-supervised Non-negative Matrix Factorization (SSNMF): This baseline method, proposed by (Smaragdis et al., 2007), ﬁrst trains a set of l bases on the observed distribution samples B by Sparse NMF (Hoyer, 2004; Kim & Park, 2007). It factorizes B = Hb Wb, with activations Hb and bases Wb, all matrices are non-negative. The optimization is solved using the Non-negative Least Squares solver by (Kim & Park, 2011). It then proceeds to train another factorization on the mixture Y training samples with 2l bases, where the ﬁrst l bases (Wb) are ﬁxed to those computed in the previous stage: Y = Hb y Wb + Hx y Wx. The separated sources are then: x = Hx y Wx and b = Hb y Wb.

Adversarial Masking (AM): As an additional contribution, we introduce a new adversarial semi-supervised method, to improve over the shallow NMF baseline. AM trains a masking function m() so that after masking, the training mixtures are indistinguishable from the distribution of source B under an adversarial discriminator D(). The loss functions (using LS-GAN (Mao et al., 2017)) are given by:

D = arg min D X

y Y D (y m(y))2+ X

b B (D (b) 1)2 (11)

m = arg min m X

y Y (D(y m (y)) 1)2 (12)

Differently from Cycle GAN (Zhu et al., 2017) and Disco GAN (Kim et al., 2017), AM is not bidirectional and cannot use cycle constraints. We have found that adding a magnitude prior L1(m(y), 1) improves performance and helps prevent collapse. To partially alleviate mode collapse, we use Spectral Norm (Miyato et al., 2018) on the discriminator.

We evaluated our proposed methods:

Latent Mixture Masking (LMM): LMM on melspectrograms or images at 64 64 resolution.

Neural Egg Separation (NES): The NES method detailed in Sec. 3. Initializing X estimates using a constant (0.5) mask over Y training samples.

Initialization by Another Method (AM+NES and LMM+NES): Initializing NES with the X estimates obtained by Adversarial Masking or by Latent Mixture Masking.

To upper bound the performance of our method, we also compute a fully supervised baseline, for which paired data (y = x + b, b) of b B, x X and y Y are available. We train a masking function with the same architecture as used by all other regression methods to directly regress synthetic mixtures to unmixed sources. This method uses more supervision than our method and is an upper bound.

Please see the appendix for elaborate implementation details.

4.1. Separating Mixed Images

MNIST We evaluate our method on image separation using the following experimental protocol. We split the MNIST dataset (Le Cun & Cortes, 2010) into two classes, the ﬁrst consisting of the digits 0-4 and the second consisting of the digits 5-9. We conduct experiments where one source has an observed distribution B while the other source has an unobserved distribution X. We use 12k B training images as the B training set, while for each of the other 12k B training images, we randomly sample a X image and additively combine the images to create the Y training set. We evaluate the performance of our method on 5000 Y images similarly created from the test set of X and B. The experiment was repeated for both directions i.e. 0-4 being B while 5-9 in X, as well as 0-4 being X while 5-9 in B.

In Tab. 1, we report our results on this task. For each experiment, the top row presents the results (peak signal-to-noise Ratio (PSNR) and structural similarity (SSIM)) on the X test set. Due to the simplicity of the dataset, NMF achieved reasonable performance on this dataset. LMM achieves better SSIM but worse PSNR than NMF while AM performed 1-2d B better. NES achieves much stronger performance than all other methods, achieving about 1d B worse than the fully supervised performance. Initializing NES with the masks obtained by LMM, results in similar performance

Neural Separation of Observed and Unobserved Distributions

to the fully-supervised upper bound. Initialization by AM achieved similar but slightly inferior performance to initialization by LMM and were omitted from the table for clarity.

Bags and Shoes To evaluate our method on more realistic images, we evaluate on separating mixtures consisting of pairs of images sampled from the Handbags (Zhu et al., 2016) and Shoes (Yu & Grauman, 2014) datasets, which are commonly used for evaluation of conditional image generation methods. To create each Y mixture image, we randomly sample a shoe image from the Shoes dataset and a handbag image from the Handbags dataset and sum them. For the observed distribution, we sample another 5000 different images from a single dataset. We evaluate our method both for cases when the X class is Shoes and when it is Handbags.

From the results in Tab. 1, we can observe that NMF failed to preserve ﬁne details, penalizing its performance metrics. LMM (which used a VGG perceptual loss) performed much better, due to greater expressiveness. AM performance was similar to LMM on this task, as the perceptual loss and stability of training of non-adversarial models helped LMM greatly. NES performed much better than all other methods, even when initialized from a constant mask. Initialization by LMM, helped NES achieve stronger performance, nearly identical to the fully-supervised upper bound. It performed better than initialization by AM (not shown in table) which achieved 22.5/0.85 and 22.7/0.86 . Similar conclusions can be drawn from the qualitative comparison in the ﬁgure above.

We tested our method on a standard denoising task, where the observed source is clean images and noise unobserved. We use positively clamped Gaussian noise σ = 0.1. For methods GLO/ours/supervised, we obtained PSNR: 24.4/28.5/28.5 SSIM: 0.76/0.88/0.88 on the Bags dataset and PSNR: 25.2/29.4/29.4 SSIM: 0.83/0.9/0.9 on the Shoes dataset. Our method seems to be well suited for denoising.

4.2. Separating Speech and Environmental Noise

Separating environmental noise from speech is a long standing problem in signal processing. Although supervision for both human speech and natural noises can generally be obtained, we use this task as a benchmark to evaluate our method s performance on audio signals where X and B are not dependent. This benchmark is a proxy for tasks for which a clean training set of sounds from X cannot be obtained e.g. for animal sounds in the wild, where background sounds without animal noises can easily be recorded, but clean sounds made by the animal with no background sounds are unlikely to be available.

For our experiments, we use the Oxford-BBC Lip Reading in the Wild (LRW) Dataset (Chung & Zisserman, 2016)

for speech. For noise we use audio segments from ESC-50 (Piczak, 2015), a dataset of environmental audio recordings organized into 50 semantic classes. Detailed description of our implementation for audio pre-processing and mask training can be found in the appendix. Separation quality is measured using Signal-to-Distortion ratio (SDR), measured using the BSS Eval toolbox (Vincent et al., 2006; St oter et al., 2018).

From the speech results in Tab. 2, we can observe that LMM performed similarly to Semi-Supervised NMF, and AM training performed about 3d B better than LMM. Due to the independence between the sources in this task, NES performed well, even when trained from a constant mask initialization. Performance was very close to the fully supervised result (when speech is unobserved). In this setting, initializing NES with the speech estimates obtained by LMM (or AM) did not yield improved performance.

We present in Fig. 2 the results of the different methods on a mixture from the speech dataset. It can be observed that LM captures the general features of the sources, but is not able to exactly capture ﬁne detail. The masking operation in LMM helps it recover more ﬁne-grained details, and results in much cleaner separations. We observe that NES converged quite quickly, and results improve further with increasing iterations. Quantitative SDR results are in line with this ﬁnding. A graph of SDR for different NES(k) is presented in the appendix.

4.3. Music Separation

Separating music into singing voice and instrumental music as well as drums separation from instrumental music has been a standard task for the signal processing community. Here our objective is to understand the behavior of our method in settings where X and B are dependent.

We use the MUSDB18 Dataset (Raﬁi et al., 2017), consisting for each music track of separate signal streams for the mixture, drums, bass, the rest of the accompaniment, and the vocals. We convert the audio tracks to mono, resample to 20480 Hz, and then follow the same procedure as for speech to obtain input audio features.

From the music results in Tab. 2, we can observe that NMF was the worst performer in this setting (as its simple bases do not generalize well between songs). LMM was able to do much better than NMF and was even competitive with NES on vocal-instrumental separation. Due to the dependence between the two sources and low SNR, initialization proved important for NES. Constant initialization NES performed similarly to AM and LMM. Initialization NES by LMM masks performed much better than all other methods and was competitive with the supervised baseline. LMM initialization was better than AM initialization.

Neural Separation of Observed and Unobserved Distributions

Table 1. Image Separation Accuracy (PSNR d B/SSIM)

X B Const NMF AM LMM NES LMM+NES Supervised

0-4 5-9 10.6/0.65 16.5/0.71 17.8/0.83 15.1/0.76 23.4/0.95 23.9/0.95 24.1/0.96 5-9 0-4 10.8/0.65 15.5/0.66 18.2/0.84 15.3/0.79 23.4/0.95 23.8/0.95 24.4/0.96 Bags Shoes 6.9/0.48 13.9/0.48 15.5/0.67 15.1/0.66 22.3/0.85 22.7/0.86 22.9/0.86 Shoes Bags 10.8/0.65 11.8/0.51 16.2/0.65 14.8/0.65 22.4/0.85 22.8/0.86 22.8/0.86

Figure 2. A qualitative comparison of mixtures of speech and noise (top and middle rows, respectively) separated by LM and LMM, as well as NES after k iterations. NES(k) denotes NES after k iterations. Note that LM and LMM share the same mask (bottom row), since LMM is generated by the mask computed from LM.

Mix LM LMM NES(1) NES(3) NES(5) GT

5. Understanding the Limits of NES

This section will investigate a few scenarios under which NES is expected to converge.

Optimal Masking: At each iteration, we solve the following optimization problem: Lt+1 NES = L1(mt+1(yt) yt, b), where t is the iteration number, yt is a synthetic mixture consisting of the sum of a random observed sample b B and an estimated sample xt X t (yt = b+xt). At iteration t, the optimal mask mt+1() is:

mt+1(yt) = b

There are several requirements for this optimization to work: i) Similarly to other learning-based source separation models, we assume that every mixture y has a unique decomposition into separate sources x X and b B. This means that the sources need to have distinct forms. This assumption is not exactly satisﬁed in practice, but there are cases where it is a good approximation. ii) We assume that the optimization method is able to ﬁnd the optimal solution despite the non-convexity of the network. The network needs to be sufﬁciently large to ﬁt the data. These requirements are shared by most other deep learning works.

Generalization from Yt to Y: The objective of source separation (in our formulation) is to learn the mask yield-

ing m(y) = b y for every y Y. NES, however, is only able to operate on the approximate distribution Yt, where t is the iteration number. To achieve the supervised performance, NES attempts to progressively improve its estimation: Yt Y. At each iteration, the approximation of y is updated using the most recent mask: yt+1 = b + (y mt+1(y)y) (This approximation is not actually known to us as the particular b component of y is unknown). Convergence of the distribution can be measured by the difference between the estimated yt and the correct sample y. The absolute error |et+1(y)| is deﬁned below (in this discussion all operations are element-wise):

|et+1| = |yt+1 y| = |b mt+1(y) y| (14)

As mt+1 was trained on Yt, rather than Y, we do not know apriori how it generalizes on the true Y distribution. Let us consider several scenarios:

Perfect Generalization: If mt+1 trained on Yt generalizes perfectly to all y Y, then mt+1(y) = b y. In this case, |et+1(y)| = |b b

y y| = 0. A single iteration is therefore sufﬁcient for convergence.

Locally Invariant mt+1: Instead of assuming perfect generalization, we consider the case that mt+1 is locally invariant around yt values. The assumption is that mt+1(y)

Neural Separation of Observed and Unobserved Distributions

Table 2. Audio Separation Accuracy (Median SDR d B)

X B Const NMF AM LMM NES AM+NES LMM+NES Supervised

Vocals Instrumental 0.0 0.0 0.0 0.6 0.3 1.2 2.0 2.8 Drums Instrumental 0.1 -0.6 1.2 0.8 1.3 2.9 3.4 3.7 Speech Noise 3.0 2.7 6.0 2.3 7.8 7.2 7.6 8.0 Noise Speech 3.0 2.8 5.2 2.7 6.3 6.4 6.1 8.1

mt+1(yt) = b yt . The error will become:

|et+1| = |b mt+1(yt) y| = |b b

yt y| = | b

We ﬁnally obtain:

|et+1| = | b

yt ||et(y)| (16)

In this case, for b < yt (or xt > 0), we obtain |et+1| |et(y)|. Under these conditions the error will decrease for non-zero estimates of the unobserved signal.

Slowly Varying mt+1: In the general case where NES is not locally invariant around yt, let us assume mt+1 changes slowly enough so that there exists a constant λ satisfying:

|b mt+1(y) y| λ |b mt+1(yt) y| (17)

It is possible to view λ as a measure of generalization. mt+1

satisfying better generalization properties will have a lower value of λ. Perfect generalization is recovered with λ = 0, and invariant mt+1 is achieved with λ = 1. For general λ, the error is at most larger than Eq. 16 by a factor of λ:

|et+1| λ| b

yt ||et| (18)

We can immediately see that convergence will occur for elements satisfying: | b

λ. For increasing values of λ i.e. decreasing generalization, only larger estimated values of xt (relative to b) will achieve decreased error.

A good initialization of m0 improves its generalization ability, decreasing its value of λ. The lower λ values increase the radius of convergence. This may explain the improved performance of better initializations.

6. Discussion

LMM vs. Adversarial Masking: LMM as a stand alone technique usually performed worse than Adversarial Masking, but served as a better initialization. We speculate that

mode collapse, inherent in adversarial training, makes the adversarial mask a lower bound on the X source distribution. LMM can result in models that are too loose (i.e. also encode samples outside of X). But as an initialization for NES, it is better to have a model that is too loose than a model which is too tight.

Automatic Label Extraction: To improve sample efﬁciency, we hypothesize that it would be possible to label only a limited set of examples as containing the target sound and not, and to use this seed dataset to train a deep sound classiﬁer to extract more examples from an unlabeled dataset. We leave this investigation to future work.

Signal-Speciﬁc Losses: To showcase the generality of our method, we chose not to encode task speciﬁc constraints. In practical applications of our method however we believe that using signal-speciﬁc constraints can increase performance. Examples of such constraints include: repetitiveness of music (Raﬁi & Pardo, 2011), sparsity of singing voice, smoothness of natural images.

Additive and Convolution Mixtures: In line with most of the literature, our approach separates additive mixtures. In some settings, the mixtures are convolutional. We leave the expansion of NES to this setting for future work.

Non-Adversarial Alternatives: The good performance of LMM vs. AM on the vocals separation task, suggests that non-adversarial generative methods may be superior to adversarial methods for separation. This has also been observed in other mapping tasks e.g. the improved performance of NAM (Hoshen & Wolf, 2018) over DCGAN (Radford et al., 2015).

7. Conclusions

In this paper we proposed a novel method, Neural Egg Separation, for separating mixtures of observed and unobserved distributions. We showed that careful initialization using LMM improves results in challenging cases. Our method achieved much better performance than other methods and was usually competitive with full-supervision. Analytical results were presented to motivate the success of our method.

Neural Separation of Observed and Unobserved Distributions

Acknowledgements

We thank Lior Wolf for fruitful discussions and for coining the name Egg Separation .

Barker, T. and Virtanen, T. Semi-supervised non-negative tensor factorisation of modulation spectrograms for monaural speech separation. In 2014 International Joint Conference on Neural Networks (IJCNN), pp. 3556 3561. IEEE, 2014.

Bojanowski, P., Joulin, A., Lopez-Paz, D., and Szlam, A. Optimizing the latent space of generative networks. In ICML 18, 2018.

Boll, S. Suppression of acoustic noise in speech using spectral subtraction. TASSP, 1979.

Chung, J. S. and Zisserman, A. Lip reading in the wild. In Asian Conference on Computer Vision, 2016.

Davies, M. E. and James, C. J. Source separation using single channel ica. Signal Processing, 87(8):1819 1832, 2007.

Denton, E. L. et al. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, pp. 4414 4423, 2017.

Gabbay, A., Ephrat, A., Halperin, T., and Peleg, S. Seeing through noise: Visually driven speaker separation and enhancement. ar Xiv preprint ar Xiv:1708.06767, 2017.

Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep learning, volume 1. MIT Press, 2016.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. betavae: Learning basic visual concepts with a constrained variational framework. 2016.

Hoshen, Y. and Wolf, L. Nam: Non-adversarial unsupervised domain mapping. In ECCV 18, 2018.

Hoyer, P. O. Non-negative matrix factorization with sparseness constraints. Journal of machine learning research, 5 (Nov):1457 1469, 2004.

Huang, P.-S., Chen, S. D., Smaragdis, P., and Hasegawa Johnson, M. Singing-voice separation from monaural recordings using robust principal component analysis. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pp. 57 60. IEEE, 2012.

Huang, P.-S., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. Deep learning for monaural speech separation. In ICASSP, 2014.

Kim, H. and Park, H. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23 (12):1495 1502, 2007.

Kim, J. and Park, H. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM Journal on Scientiﬁc Computing, 33(6):3261 3281, 2011.

Kim, T., Cha, M., Kim, H., Lee, J., and Kim, J. Learning to discover cross-domain relations with generative adversarial networks. ar Xiv preprint ar Xiv:1703.05192, 2017.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Le Cun, Y. and Cortes, C. MNIST handwritten digit database. 2010.

Lee, D. D. and Seung, H. S. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pp. 556 562, 2001.

Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Smolley, S. P. Least squares generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2813 2821. IEEE, 2017.

Mimilakis, S. I., Drossos, K., Santos, J. F., Schuller, G., Virtanen, T., and Bengio, Y. Monaural singing voice separation with skip-ﬁltering connections and recurrent inference of time-frequency mask. ar Xiv preprint ar Xiv:1711.01437, 2017.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.

Piczak, K. J. ESC: Dataset for Environmental Sound Classiﬁcation. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015 1018. ACM Press, 2015.

Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015.

Raﬁi, Z. and Pardo, B. A simple music/voice separation method based on the extraction of the repeating musical structure. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pp. 221 224. IEEE, 2011.

Raﬁi, Z., Liutkus, A., St oter, F.-R., Mimilakis, S. I., and Bittner, R. The MUSDB18 corpus for music separation, December 2017. URL https://doi.org/10. 5281/zenodo.1117372.

Neural Separation of Observed and Unobserved Distributions

Roweis, S. T. One microphone source separation. In Advances in neural information processing systems, pp. 793 799, 2001.

Smaragdis, P., Raj, B., and Shashanka, M. Supervised and semi-supervised separation of sounds from singlechannel mixtures. In International Conference on Independent Component Analysis and Signal Separation, pp. 414 421. Springer, 2007.

Stoller, D., Ewert, S., and Dixon, S. Adversarial semisupervised audio source separation applied to singing voice extraction. ar Xiv preprint ar Xiv:1711.00048, 2017.

St oter, F.-R., Liutkus, A., and Ito, N. The 2018 signal separation evaluation campaign. In International Conference on Latent Variable Analysis and Signal Separation, pp. 293 305. Springer, 2018.

Subakan, C. and Smaragdis, P. Generative adversarial source separation. ar Xiv preprint ar Xiv:1710.10779, 2017.

Vincent, E., Gribonval, R., and Fevotte, C. Performance measurement in blind audio source separation. Trans. Audio, Speech and Lang. Proc., 14(4):1462 1469, 2006. ISSN 1558-7916.

Wang, Y., Narayanan, A., and Wang, D. On training targets for supervised speech separation. TASLP, 2014.

Wilson, K. W., Raj, B., Smaragdis, P., and Divakaran, A. Speech denoising using nonnegative matrix factorization with priors. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pp. 4029 4032. IEEE, 2008.

Yu, A. and Grauman, K. Fine-grained visual comparisons with local learning. In CVPR, 2014.

Yu, D., Kolbæk, M., Tan, Z.-H., and Jensen, J. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In ICASSP, 2017.

Zhu, J.-Y., Kr ahenb uhl, P., Shechtman, E., and Efros, A. A. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pp. 597 613. Springer, 2016.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.