# deepgem_generalized_expectationmaximization_for_blind_inversion__264ac0c8.pdf Deep GEM: Generalized Expectation-Maximization for Blind Inversion Angela F. Gao1 Jorge C. Castellanos2 Yisong Yue1 Zachary E. Ross2 Katherine L. Bouman1 1Computing and Mathematical Sciences, California Institute of Technology 2Caltech Seismological Laboratory, California Institute of Technology {afgao, jcastellanos, yyue, zross, klbouman} @ caltech.edu Typically, inversion algorithms assume that a forward model, which relates a source to its resulting measurements, is known and fixed. Using collected indirect measurements and the forward model, the goal becomes to recover the source. When the forward model is unknown, or imperfect, artifacts due to model mismatch occur in the recovery of the source. In this paper, we study the problem of blind inversion: solving an inverse problem with unknown or imperfect knowledge of the forward model parameters. We propose Deep GEM, a variational Expectation-Maximization (EM) framework that can be used to solve for the unknown parameters of the forward model in an unsupervised manner. Deep GEM makes use of a normalizing flow generative network to efficiently capture complex posterior distributions, which leads to more accurate evaluation of the source s posterior distribution used in EM. We showcase the effectiveness of our Deep GEM approach by achieving strong performance on the challenging problem of blind seismic tomography, where we significantly outperform the standard method used in seismology. We also demonstrate the generality of Deep GEM by applying it to a simple case of blind deconvolution. 1 Introduction Physics-based inversion methods typically recover an unknown source from indirect measurements by assuming that the source and measurements are related via a known forward model [15, 7, 30]. For example, non-blind deconvolution algorithms often assume that a measured blurry image is related to its true sharp image via a known spatially-invariant blur kernel [9]; and traditional seismic inversion methods assume that the spatially-varying velocity of the Earth s interior is known a priori when solving for an earthquake s hypocenter [30]. However, these known forward models are generally idealized and ignore intricacies of the systems that are either hard to capture or simply unknown. Inversion algorithms with forward model mismatch result in biased reconstructions. For instance, bias is regularly seen in non-blind deconvolution results, where reconstruction artifacts are often present due to the use of an incorrect blur kernel [9]. In order to reduce the effects of model mismatch, in this paper we tackle the problem of blind inversion: solving an inverse problem without knowledge of the underlying forward model parameters. When considering how learning can help, a natural first idea might be to learn a direct map from measurements to the desired source via supervised learning. However, such a model-free approach is generally not practical, due to the lack of available ground truth training data. For example, the blur kernel caused by handheld camera shake cannot be reproduced to get a training set of 35th Conference on Neural Information Processing Systems (Neur IPS 2021), . sharp-blurry pairs to train a deconvolution approach. In seismic tomography, synthetic earthquakes cannot be placed densely throughout the interior of the Earth to measure the ground response as a function of hypocenter location. An alternative approach, which we adopt, is to develop unsupervised methods that treat the true forward model as something that is unobserved and must be inferred. An additional consideration is that we must solve for the true forward model from a single dataset, without knowledge of the true source that produces the measurements In this paper, we propose Deep Generalized Expectation-Maximization (Deep GEM) for solving blind inverse problems. Using the indirect measurements as input, Deep GEM jointly estimates the source and forward model that together produce the observed measurements. Deep GEM is a variational inference based framework that makes use of deep learning machinery to easily capture and optimize complex probabilistic distributions that cannot be easily integrated in analytic Expectation Maximization (EM) solutions. Our proposed framework is generic and can be applied to blind inversion problems described by differentiable forward models. In Section 4 we showcase the effectiveness of our Deep GEM approach on the challenging problem of blind seismic tomography, where we significantly outperform methods used in seismology. We also demonstrate the generality of Deep GEM by applying it to blind deconvolution in Section 5. True Earth Velocity Travel Time Measurements (true source locs.) Baseline Velocity Recon. (recovered source locs.) Deep GEM Velocity Recon. (recovered source locs.) Velocity km/s seconds Velocity km/s Velocity km/s (a) (b) (c) (d) 5 km Figure 1: Deep GEM applied to blind seismic tomography. (a) A simulated cross section of the Earth s interior (velocity structure), along with the locations of receivers on the surface (red triangles) that collect measurements. (b) The time it takes for a wave traveling from a source below the surface to reach the specified receiver is visualized for each location in the region of interest. The overlaid dots represent the true locations of simulated earthquakes and indicate the measured travel times that constrain optimization. (c) The subsurface velocity reconstruction obtained using a baseline approach optimized with the help of a seismologist. Note that the bright anomaly is missing from this reconstruction. Overlaid dots represent the inferred earthquake locations. (d) Deep GEM reconstructed subsurface velocity and inferred earthquake locations. Note that Deep GEM is able to accurately recover the gradient of the velocity field as well as partially recover the central anomaly. 2 Background and related work The joint optimization of a forward model with source recovery is a very challenging ill-posed problem, leading to many possible solutions that are hard to disambiguate. For example, in blind deconvolution the blurry image observed can be equivalently explained by a sharp source image convolved with an extended kernel or a blurry source image convolved with a impulse kernel; to prefer one solution over another, additional information, such as image priors, must be considered. Previous work on inversion in poorly characterized systems focused on limited contexts, such as spatially-invariant blind deconvolution [21, 22, 14, 10] and CT with a simple rotational error [35]. These methods tend to be highly specialized to each application domain, and cannot be easily generalized. In contrast, we propose a flexible model-based Bayesian framework that can be applied across multiple differentiable blind inversion problems. 2.1 Model-based Bayesian inversion In model-based inversion, unobserved sources x and observed measurements y are related through a forward model: y = f(x). When the parameters of the underlying model are unknown, one can parameterize the forward model as fθ(x) and then solve for the true model parameters θ . A common approach is to solve a maximum a posteriori (MAP) objective: either MAPθ,x or MAPθ. MAPθ,x solves for the optimal point estimate of the pair {ˆθ, ˆx} that maximizes a joint objective: {ˆθ, ˆx} = arg max θ,x [log p(θ, x|y)] = arg max θ,x [log p(y|θ, x) + log p(θ) + log p(x)] . (1) In practice, formal probabilistic definitions of p(θ) and p(x) are often unknown and replaced with regularization terms (e.g., total variation, sparsity) [22, 21]. Although MAPθ,x provides a straightforward approach to solve for θ, the energy landscape of Eq. 1 is typically poorly behaved due to the ill-posed nature of the problem [22]. As a result, optimization is likely to get stuck in a (bad) local minimum. MAPθ attempts to smooth the energy landscape by reducing the number of parameters that must be optimized. This is done by solving for parameters of the forward model, θ, that perform best under the full volume of possible x interpretations: ˆθ = arg max θ [log p(θ|y)] = arg max θ x p(y|θ, x)p(x)dx + log p(θ) . (2) Since this marginalization integral is often intractable, Expectation-Maximization (EM) algorithms have long been used for solving MAPθ efficiently [11]. EM is an iterative algorithm that alternates between: E -Step) calculating the posterior of x conditioned on the current estimated forward model parameters θ(t 1); and M-Step) updating θ to maximize the expected value of the log likelihood: θ(t) = arg max θ Ex p(x|y,θ(t 1)) [log p(y|θ, x)] + log p(θ) . (3) The advantage of MAPθ over MAPθ,x optimization for the blind deconvolution problem was described in [22] and its success demonstrated via EM optimization in [21]. However, it is important to note that evaluating the expectation in of Eq. (3) over complex distributions is often intractable. For instance, the authors of [21] were forced to restrict the posterior distribution to a Gaussian distribution. Alternatively, stochastic EM methods [5, 8] bypasses the need to evaluate the expectation directly, approximating it by sampling the distribution. In this paper, we solve MAPθ using complex distributions parameterized by deep neural networks. Forward model parameterization Model-based inversion requires that the parametric form of fθ(x) is well matched with the true forward model, which is not always known. Alternatively, neural networks can be used to approximate the forward model, where θ represents the network weights. This setup is flexible, in that it can be used to approximate arbitrarily complex forward models, with the downside that it is often not interpretable when the parameters have no physical meaning. In this work, we develop and make use of interpretable, physically-motivated deep neural networks to parameterize fθ( ) for the problems of blind seismic tomography and blind deconvolution. 2.2 Deep variational distributions We are interested in solving blind inverse problems using the EM algorithm to optimize the MAPθ objective. Optimizing in this fashion requires the use of the posterior distribution p(x|y, θ(t 1)) in evaluating Eq. 3. As inverse problems are often ill-posed, we expect that the posterior distribution of source x conditioned on the forward model parameters θ is likely to be complex and perhaps even multi-modal. Therefore, it is important to be able to parameterize a flexible family of distributions to best estimate this conditional posterior. Deep Probabilistic Imaging (DPI) [32] uses a normalizing flow-based generative model, Gφ( ), to solve for the uncertainty of an unknown source x given a fixed forward model f(x) and measurements y. DPI solves the variational objective: ˆφ = arg min φ KL(qφ(x)||p(x|y)) = arg min φ Ex qφ(x) [ log p(y|x) log p(x) + log qφ(x)] , (4) where qφ(x) is the implicit distribution defined by Gφ(z) for z R|x| N(0, 1)1, and log p(y|x) ||y f(x)||2 + c when there exists i.i.d. Gaussian noise on the measurements, y. After inference, qφ(x) can be efficiently sampled by evaluating Gφ(z) for a Gaussian sample z. The DPI variational objective is equivalent to the Variational Autoencoder [17, 28] objective, except with a fixed decoder, f(x). In practice, vanilla VAEs constrain the posterior to be a Gaussian 1|x| is the dimension of x distribution, relying on the reparameterization trick for tractable optimization. Alternatively, DPI uses flow-based networks to efficiently sample and directly compute qφ(x) [12, 13, 18]. DPI s use of a flow-based network allows for complicated and multi-modal posterior distributions constrained only by the space of possible invertible network architectures. Our proposed Deep GEM approach utilizes similar tools to model flexible distributions over x, while simultaneously learning the forward model parameters θ. We propose a deep variational EM approach (Deep GEM) that optimizes the MAPθ objective in Eq. 2 to recover the parameters of a forward model fθ(x) using only a single static set of measurements y. Once learned, the updated forward model can then be used to estimate the posterior distribution of the unknown source, x. Deep GEM iterates between two stages that are inspired by the standard EM algorithm: (1) an E -step that learns a variational distribution, qφ(t)(x), to approximate the posterior distribution of x given the current forward model parameters θ(t 1), and (2) an M-step (refer to Eq. 3) that solves for θ(t) that maximizes the expected value of the log likelihood function of θ, with respect to the posterior distribution qφ(t)(x) estimated in the prior E -step. Each step is alternated and solved to convergence. Since this is an EM algorithm (up to the variational approximation), our method inherits properties of EM, including convergence to a local minimum of Eq. 2. 3.1 Expectation step ( E -step) The goal of Deep GEM s E -step is to solve for the posterior distribution p(x|y, θ(t 1)) that facilitates optimizing Eq. 3. Because this posterior distribution can be very complex, and even multi-modal, we propose to use a flexible variational approach to learn the parameters φ of an approximate posterior distribution qφ(x). The variational distribution qφ(x) can then be used to evaluate Eq. 3 via efficient sampling. Using DPI we solve for a flexible variational distribution, qφ(x) that well approximates the posterior distribution p(x|y, θ(t 1)). DPI uses a normalizing flow network, Gφ(z), with input z R|x|, where x = Gφ(z) qφ(x) when z N(0, 1). Normalizing flow networks allow for exact computation of the log-likelihood log qφ(x) needed to solve φ(t) = arg min φ KL(qφ(x)||p(x|y, θ(t 1))) arg min φ 1 N n=1 [ log p(y|θ(t 1), xn) log p(xn) + log qφ(xn)] for xn = Gφ(zn), zn N(0, 1), (5) (as derived from Eq. 4) for a batch size of N where log p(x) is a prior on the source and log p(y|x, θ(t)) is the data likelihood. When assuming the measurements y experience i.i.d additive Gaussian noise with standard deviation σy, log p(y|θ(t), xn) = 1 2σ2y y fθ(t)(xn)) 2 + c. 3.2 Maximization step (M-step) The goal of Deep GEM s M-step is to use the parameterized approximate posterior distribution, qφ(t)(x), from the prior E -step to update θ, the parameters of the unknown forward model fθ( ). This is achieved by sampling from the learned normalizing flow network, Gφ(t)( ), to stochastically solve Eq. 3 : θ(t) arg max θ n=1 [log p(y|θ, xn)] + log p(θ) for xn = Gφ(t)(zn), zn N(0, 1), (6) where p(θ) is a prior on the forward model. This prior can be used to encourage the forward model parameters to remain close to an initial model θ by defining log p(θ) ||θ θ||2 + c. 4 Case study: blind seismic tomography Two fundamental seismic inverse problems are spatially localizing an earthquake s hypocenter (also referred to as source localization) and tomographic reconstruction of the Earth s subsurface [30]. These problems are interconnected: source localization relies on knowing how fast waves propagate through different regions of the Earth s interior, referred to as the Earth s velocity structure. However, in standard seismological practice, due to difficulties in solving these problems jointly, they are generally treated separately: source localization is performed initially using oversimplified velocity models [30], and then the tomography problem is performed by taking those best-fitting hypocenters as ground truth [26]. This approach typically results in the need to over-regularize the inverse problem by smoothing out high-frequency information [2], and can only be improved by carefully incorporating other forms of information such as waveform-derived quantities [16, 23, 4, 24] or by performing costly experiments such as controlled explosions. In contrast, we demonstrate our Deep GEM approach on blind seismic tomography, solving for the subsurface velocity (parameterized by θ) when the source hypocenters, x, are unknown. Measurements, y, used to constrain the inverse problem are the time it takes for the first wave to propagate from its source to a receiver on the Earth s surface, referred to as a travel time measurement (refer to Fig. 1). 4.1 Seismic tomography background Physics of earthquake source localization: The earthquake source location, also called the hypocenter, is the location where the earthquake nucleates [30]. The source location can be triangulated using travel times from multiple receivers. However, when there are very few receivers (< 3 in 3D, < 2 in 2D), the source localization is ill-posed and there exists multiple equally optimal solutions. Physics of travel time tomography: Travel time tomography is a method for reconstructing the velocity structure using the absolute arrival times of earthquake waves from the earthquake to the receiver [30]. Given perfect knowledge of the earthquake locations x and receivers r at every position in the ground, the exact velocity can be computed by solving the Eikonal equation V (r) = r T(x, r) 1 2 , where T(x, r) is the travel time from an earthquake to a receiver. There exists an analytical solution to the Eikonal equation; however, seismologists often use simplifications, such as straight ray tomography, for efficiency [2]. Deep Learning for travel time tomography: Eiko Net [31] implicitly solves the Eikonal equation [33] by learning a mapping from a source-receiver pair (x, r) to its associated travel time, learning θ such that fθ(x, r) T(x, r) where T(x, r) is the true travel time. The velocity structure can then be extracted from the learned Eiko Net by solving the Eikonal equation through automatic differentiation. In particular, the computed velocity V (s) at location s is sfθ(x, s) 1 2 . Note that this computed velocity depends on the source location x. Ideally, the velocity should be invariant to the source location; however, in practice, this is only true when Eiko Net is trained with densely-sampled (x, r) pairs. Additionally, the travel time from a source to a receiver, T(x, r), should be identical to the travel time from a receiver to a source, T(r, x), but remains unconstrained by Eiko Net. Homogenous Gradient Layer Deep GEM ℒϴ Prior Models Receiver(s) True source Prior mean of source Posterior conditioned on ϴ Earthquake Source Localization Visualizations (a) (b) 10 km Figure 2: (Left) A visualization of the homogeneous, gradient, and layer models used in p(θ) s Lθ term. (Right) Visualizations used to describe source configurations and earthquake source posteriors. (a) Visualizations of the true and initialized source locations are plotted as stars (true) connected to circles, which indicate the expected source locations according to p(x). Note that the expected source locations deviate significantly from the true locations. (b) Visualizations of the learned posterior distribution, qφ(x|y, θ), for each source are plotted as colored histograms and overlaid with stars (of the same color) indicating the true source location. Deep GEM with Uncertain Source Locs. No ℒϴ Prior Homogenous Gradient Layer Baselines MAPϴ, x w/ Gradient Prior Straight Ray w/ Gradient Init. Known Source Locs. w/ No ℒϴ Prior 9 sources 25 sources 49 sources True & Prior Source Locs. Figure 3: Deep GEM reconstructions significantly outperform baselines, and improve with more sources and better Lθ prior models. Each row corresponds to the reconstructions obtained from a single noisy observation of travel times, where the true Earth velocity is shown in Fig. 1(a). Results shown are simulated using 20 surface receivers and a varying number of sources (9, 25, and 49) in a uniform grid. Columns 3-6 show Deep GEM results obtained using different Lθ priors. Note that results improve as the Lθ prior becomes closer to the true velocity structure, and as the number of sources increases. As a reference, column 2 shows the velocity reconstruction obtained using Deep GEM under fixed, perfectly known source locations. Columns 7-8 show results obtained by the baseline approaches. The velocity reconstruction MSE is included in the top right of each reconstruction. Deep GEM substantially outperforms both straight ray and MAPθ,x baselines. Note that Deep GEM also sometimes outperforms the second column, where the source locations are known; this is due to the fact that Deep GEM is sometimes able to absorb measurement error due to its overparameterization. See the supplemental material for more examples. 4.2 Deep GEM setup for blind seismic tomography For blind seismic tomography, we parameterize the forward model using a modification of Eiko Net, fθ(x, r), with unknown source location x and known receiver location r as inputs and y T(x, r), the absolute travel time between x and r, as output. In order to solve Eqs. 5 and 6 for an updated forward model, we must define priors p(x) and p(θ). The prior over source locations, p(x), is often well defined, typically a Gaussian distribution N( x, σx) with a standard deviation of σx 2 km and mean x. We construct a prior over the forward model, p(θ), that encourages Eiko Net to learn a velocity close to V (s). Additionally, as discussed above, there are constraints specific to seismic tomography that are not explicitly enforced through Eiko Net s architecture: (1) velocity reconstruction invariance with respect to the source location, and (2) travel time symmetry between sources and receivers. We augment the prior p(θ) to include these constraints, implemented as soft constraints: log p(θ) = λθ Lθ z }| { X || V (s) Vr(s)||2 +λV LV z }| { X ri,rj R, s S Vri(s) Vrj(s) 2 +λT LT z }| { X T(s, r) T(r, s) 2 . The velocity constraint is represented through LV and travel time constraint through LT , with corresponding hyperparameters λV and λT . S is a set of points sampled uniformly from the region of interest, R is the set of all receiver locations, and Vr(s) = || sfθ(r, s)|| 1 2 . Implementation details: In our experiments we define a realistic prior over unknown source locations as p(x) = N( x, σx), where x N(x, σx) and σx = 2 km. We assume the measurements y are sampled from a Gaussian distribution with mean y true travel times computed using the package eikonalfm [29, 33] and a realistic standard deviation of σy = 0.2 seconds. To simulate real world passive tomography, we assume receivers are located only along the surface (the top edge of the 2D image) of the region of interest, which is 20 km 20 km, and sources can be anywhere within the region of interest. Before Deep GEM optimization, we first initialize θ using maximum likelihood estimation assuming x is the true source location. Source Locs. 1 Source Locs. 2 Source Locs. 3 Source Locs. 4 Source Locs. 5 Source Locs. w/ Gradient ℒϴ Prior Source Locs. w/ Gradient ℒϴ Prior Source Locs. Source Count Velocity Error (km/s) Source Loc. Error (m) 9 0.52 0.063 62.88 21.77 25 0.42 0.043 49 0.44 0.181 100 0.27 0.021 Table 1: Velocity and source localization error on random source configurations Figure 4: Deep GEM consistently recovers prominent features across various earthquake source configurations. Row 2 shows Deep GEM results obtained using different random source configurations, where the true Earth velocity is shown in Fig. 1(a). As a reference, row 1 shows the velocity reconstruction obtained using Deep GEM under fixed, perfectly known source locations. As can be seen by the table, both velocity and localization error decrease with an increasing number of sources in the region of interest. Each mean and standard deviation is computed using 5 random realizations of true source configurations. The posterior distribution of the source locations, x, is estimated using a Real-NVP network Gφ( ) with 16 affine coupling layers. An updated Eiko Net (described in the supplemental material) has been modified to parameterize fθ(x). This Eiko Net is pretrained with samples from the prior p(x) as input and the simulated travel time measurements as output. We use Adam as the optimizer [36] with a batch size of 32 and an E-step learning rate of 1e-3 and M-step learning rate of 5e-5. Hyperparameters (λT , λV , λθ) were empirically chosen by inspecting the loss on a grid search over hyperparameters. Results presented have been run with 10 EM iterations, each with 800 E -step epochs and 2000 M-step epochs. Each Deep GEM model takes 6 hours on a NVIDIA Quatro RTX 5000. 4.3 Results Comparison to Baseline Approaches: We compare results from Deep GEM to results obtained using a baseline run by a seismologist. This iterative baseline alternates between source inversion and straight ray tomography, and is the standard approach used for blind tomography. Further detail on this baseline is provided in the supplemental material. The gradient model shown in Fig. 2 is used to perform the initial source inversion for this baseline; nonetheless, we find that the solution quickly diverges. Therefore, we rely on the expertise of the domain expert to decide when to terminate the optimization. As seen in Fig. 3, Deep GEM consistently outperforms this human-in-the-loop baseline across all source configurations. In Fig. 3 we also compare with a MAPθ,x solution. MAPθ,x is consistently outperformed by the Deep GEM MAPθ approach across all source configurations. Additional results highlighting the sensitivity of MAPθ,x to hyperparameters and initialization are shown in the supplemental material. Sensitivity to Lθ Prior Choice: We evaluate Deep GEM s recovery of the true velocity structure shown in Fig. 1 using one of three different Lθ priors shown in Fig. 2: homogeneous, gradient, and layer, as well as Lθ = 0. The homogeneous model takes on value of 6.419 km/s, the average velocity value of the true velocity structure. The gradient captures the increasing velocity as depth increases, and the layer model represents the true model without the added anomaly. As shown in Fig. 3, the mean squared error tends to decrease with the gradient and layer Lθ prior models, which are closer to the true velocity structure. Sensitivity to Source Configuration: We evaluate results with sources that are both uniformly and randomly spaced throughout the region of interest. Improved performance is expected when the number of sources is increased and/or when sources are well distributed. To better understand Deep GEM s performance we introduce an ablation, where the velocity structure is learned by training fθ( ) with access to perfect source locations x (i.e., p(x) is a delta function). As is shown in Fig 3, even when training with true source locations, the anomaly is not well resolved until 49 sources. 100 Sources 961 Sources source posterior Meas. Noise: No Yes Yes Yes Yes No No Known Sources: Yes Figure 5: Deep GEM recovery with a single receiver. Velocity reconstructions shown in columns 3 and 5 demonstrate that Deep GEM is able to learn some of the true velocity features (see Fig. 1(a)), even when limited to measurements from a single receiver. However, these reconstructions show clear signs of overfitting to the measurement data. This is demonstrated by observing that the Deep GEM reconstruction with perfectly known source locations is significantly better with a single receiver when no measurement error is included on the measurements (comparing columns 1 and 2). Note that the learned posterior in column 4 appears to be non-Gaussian, but not multimodal. Truth Gradient ℒϴ Model 1 Model 3 Model 2 No ℒϴ Prior True & Prior Source Locs. Velocity Error (km/s) Source Loc. Error (m) None (& Known Source Locs.) 0.57 0.25 NA None 0.51 0.30 Homogenous 0.94 0.45 Gradient 1.07 0.53 Table 2: Velocity and source localization error on random velocity fields Figure 6: Performance of Deep GEM recovery on random velocity fields. Ten random velocity fields were drawn from a GRF-based distribution and used to simulate travel time measurements with 20 receivers and 100 randomly placed sources. (left) Reconstructions obtained for 3 of these configurations are shown. (right) A table lists the mean and standard deviation of velocity and source error obtained across the ten models, each recovered using different Lθ priors. Fig. 3 shows one realization of Deep GEM results obtained from different counts of uniformly spaced sources. As expected, the MSE between the reconstructed and true velocity structure tends to decrease as the number of sources increases. Fig. 4 shows results obtained using five randomly generated source configurations, with 49 sources each. These results demonstrate that, although the reconstructed velocity structure is somewhat sensitive to the underlying source configuration, the primary features of the true velocity can still be recovered in all cases. Velocity and source localization error obtained for different random configurations are shown in the accompanying table. Sensitivity to Number of Receivers: In the case of a single receiver, there exists an entire ring of source locations that result in the same travel time measurement. Fig. 5 contains results from this challenging one-receiver setting using Deep GEM with/without measurement noise and with/without known source locations. Since there is only one receiver, the velocity model is able to easily overfit. However, perhaps surprisingly, artifacts are more severe when source locations are perfectly constrained. These artifacts are caused by the velocity model overfitting to noise in the travel time measurements, and are substantially reduced when noise-free travel time measurements are used. Note that the recovered source location posterior qφ(x) obtained by the E -step is non-Gaussian. Sensitivity to Velocity Structure: In Fig. 6, Deep GEM is tested on randomly generated velocity fields, each generated from a Gaussian random field (GRF) described in the supplemental material. These results show that Deep GEM works well at recovering the primary features across a variety of different velocity structures. As compared to when Lθ = 0, the Lθ gradient prior biases the reconstruction towards smoother velocity structures. The accompanying table contains error statistics for ten different randomly generated velocity fields. 5 Case study: blind deconvolution We apply Deep GEM to the problem of blind deconvolution to further demonstrate the generality of our approach. Blind deconvolution is a classic ill-posed imaging problem that aims to reconstruct a sharp image from a blurry image with an unknown PSF [19, 14, 25, 21, 22, 20, 6, 1, 27]. Blurry images, caused by handheld camera shake, can be modeled using a single spatially-invariant blur kernel: y = x kθ + ε for ε N(0, σ), (8) where y is the blurry image, represents a 2D convolution, x is the true sharp image, kθ is the spatially invariant blur kernel, and ε is additive Gaussian noise. 5.1 Deep GEM setup for blind deconvolution For blind deconvolution, we parameterize the forward model fθ( ) using a deep network consisting of multiple convolution layers without non-linear activation, as proposed in [3]. Multiple convolutional layers without activation simply overparameterizes a linear blur kernel, which has been empirically shown to produce multiple good minima that are easier to converge to. To ensure the blur kernel is non-negative and volume preserving, we use a Softmax layer and normalize the kernel. For an n layer network with weights θi for i = 1, ..., n, the resulting parameterized forward model is: fθ(x) = x ˆkθ = x Softmax(θ1 θ2 ...θn) Softmax(θ1 θ2 ...θn) 1 Measured Blurry Image True Sharp Image and Measured Blur Kernel Reconstructed Image and Blur Kernel using Deep GEM Reconstructed Image using Measured Blur Kernel Figure 7: Deep GEM results are comparable to inversion with known blur kernels for simple Fashion MNIST images. Blurry measured images (column 1) are generated using the true sharp image and blur kernel (shown in column 2). The deconvolved images obtained using the true blur kernel are shown in column 3. Reconstructed deconvolved images and corresponding inferred blur kernel from Deep GEM are shown in column 4. Implementation details: We demonstrate Deep GEM using simple Total Variation (TV) regularization in place of log p(x). We assume a Gaussian prior on the noise on the blurry measurements where ε N(0, 0.01) as well as a sparsity prior on the reconstructed kernel through an ℓ0.8 soft constraint. The posterior distribution of the sharp image, x, is estimated using a Real-NVP network Gφ( ) with 16 affine coupling layers. We use 5 convolution layers to parameterize kθ. We use Adam as the optimizer [36] with a batch size of 64 and an E-step learning rate of 5e-4 and M-step learning rate of 1e-4. Hyperparameters, weights used for sparsity and TV priors, were empirically chosen by a grid search over hyperparameters. Results presented have been run with 10 EM iterations, each with 400 E -step epochs and 400 M-step epochs, which takes 1 hour on a NVIDIA Tesla V100. 5.2 Results In Fig. 7 we show results from Deep GEM on three different Fashion-MNIST [34] images. The Fashion MNIST images are 64 64 pixels in size and each blur kernel is contained within 15 15 pixels. We choose to use images from Fashion-MNIST because they are compatible with the TV prior used in place of log p(x). In contrast, other natural images such as images from [22] (which were originally curated for deconvolution) are not compatible with this handcrafted prior. While this highlights that handcrafted priors such as TV are often insufficient, we choose to leave this to future work. Please refer to the supplemental material for more details about the performance of a TV prior on different types of images. The blurry images in the first column of Fig. 7 exhibit artifacts from the blur kernel in the form of repeated features, caused by the bi-modal blur kernel. The images recovered via Deep GEM are much sharper and are comparable to the images reconstructed with the known blur kernel. Additionally, Deep GEM reconstructs blur kernels (i.e., kθ) that are similar to the true kernel shape, with two lobes along the same diagonal. Note that even when there is perfect knowledge about the shape of the blur kernel, detailed structures on the images are not recovered due to the TV prior. 6 Conclusions In this paper we present Deep GEM, a deep probabilistic framework for tackling blind inverse problems through estimation of the forward model. Deep GEM achieves strong performance in the task of joint seismic tomography and earthquake source localization, substantially outperforming standard approaches currently being used in seismology on synthetic data. The proposed framework is flexible and can be applied to different applications that require estimation or fine tuning of forward model parameters. We demonstrate this flexibility by also applying the approach to a simple, but challenging, blind deconvolution problem. Future work includes applying this method to real seismic data, extending to other applications, and incorporating data-driven priors. Our results highlight the benefit of blending physically sound model-based techniques with learning machinery for blind inverse problems. Broader Impact Deep GEM can be used to solve for a system s model mismatch, which can then help improve our understanding of complex physical systems. However, this tool is not trustworthy enough for safety-critical systems. Nonetheless, this approach can benefit society through a better understanding of fundamental science and advanced earthquake prediction models via seismic imaging. Acknowledgments and Disclosure of Funding This research was carried out at the Jet Propulsion Laboratory and the California Institute of Technology under a contract with the National Aeronautics and Space Administration and funded through the President s and Director s Research and Development Fund (PDRDF). This work was sponsored by Beyond Limits, Jet Propulsion Laboratory Award 1669417, NSF Award 2048237, and generous gifts from Luke Wang and Yi Li. Additionally, we would like to thank He Sun for many helpful discussions. We declare no competing interests. [1] Raied Aljadaany, Dipan K Pal, and Marios Savvides. Douglas-rachford networks: Learning both the image prior and data fidelity terms for blind image deconvolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10235 10244, 2019. [2] M. P. Barmin, M. H. Ritzwoller, and A. L. Levshin. A Fast and Reliable Method for Surface Wave Tomography, pages 1351 1375. Birkhäuser Basel, Basel, 2001. [3] SefiBell-Kligler, Assaf Shocher, and Michal Irani. Blind super-resolution kernel estimation using an internal-gan. Co RR, abs/1909.06581, 2019. [4] Ebru Bozda g, Jeannot Trampert, and Jeroen Tromp. Misfit functions for full waveform inversion based on instantaneous phase and envelope measurements. Geophysical Journal International, 185(2):845 870, 2011. [5] Olivier Cappé and Eric Moulines. On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):593 613, Jun 2009. [6] Ayan Chakrabarti. A neural approach to blind motion deblurring. In European conference on computer vision, pages 221 235. Springer, 2016. [7] Chengqian Che, Fujun Luan, Shuang Zhao, Kavita Bala, and Ioannis Gkioulekas. Inverse transport networks. ar Xiv preprint ar Xiv:1809.10820, 2018. [8] Jianfei Chen, Jun Zhu, Yee Whye Teh, and Tong Zhang. Stochastic expectation maximization with variance reduction. In Neur IPS, pages 7978 7988, 2018. [9] S. Cho, Jue Wang, and S. Lee. Handling outliers in non-blind image deconvolution. In 2011 International Conference on Computer Vision, pages 495 502, 2011. [10] Sunghyun Cho and Seungyong Lee. Fast motion deblurring. In ACM SIGGRAPH Asia 2009 Papers, SIGGRAPH Asia 09, New York, NY, USA, 2009. Association for Computing Machinery. [11] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1 22, 1977. [12] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation, 2015. [13] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp, 2017. [14] Rob Fergus, Barun Singh, Aaron Hertzmann, Sam T. Roweis, and William T. Freeman. Removing camera shake from a single photograph. In ACM SIGGRAPH 2006 Papers, SIGGRAPH 06, page 787 794, New York, NY, USA, 2006. Association for Computing Machinery. [15] Jeffrey A Fessler. Model-based image reconstruction for mri. IEEE signal processing magazine, 27(4):81 89, 2010. [16] Andreas Fichtner, Brian LN Kennett, Heiner Igel, and Hans-Peter Bunge. Theoretical background for continental-and global-scale full-waveform inversion in the time frequency domain. Geophysical Journal International, 175(2):665 685, 2008. [17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [18] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in neural information processing systems, pages 10215 10224, 2018. [19] D. Kundur and D. Hatzinakos. Blind image deconvolution. IEEE Signal Processing Magazine, 13(3):43 64, 1996. [20] Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiˇrí Matas. Deblurgan: Blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8183 8192, 2018. [21] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman. Efficient marginal likelihood optimization in blind deconvolution. In CVPR 2011, pages 2657 2664, 2011. [22] Anat Levin, Yair Weiss, Fredo Durand, and William T Freeman. Understanding and evaluating blind deconvolution algorithms. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1964 1971. IEEE, 2009. [23] Jingrui Luo and Ru-Shan Wu. Seismic envelope inversion: reduction of local minima and noise resistance. Geophysical Prospecting, 63(3):597 614, 2015. [24] Yi Luo and Gerard T Schuster. Wave-equation traveltime inversion. Geophysics, 56(5):645 653, 1991. [25] D. Perrone and P. Favaro. Total variation blind deconvolution: The devil is in the details. In 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2909 2916, Los Alamitos, CA, USA, jun 2014. IEEE Computer Society. [26] Nicholas Rawlinson, S Pozgay, and S Fishwick. Seismic tomography: a window into deep earth. Physics of the Earth and Planetary Interiors, 178(3-4):101 135, 2010. [27] Dongwei Ren, Kai Zhang, Qilong Wang, Qinghua Hu, and Wangmeng Zuo. Neural blind deconvolution using deep priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3341 3350, 2020. [28] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278 1286. PMLR, 2014. [29] J. A. Sethian. Fast marching methods. SIAM Review, 41(2):199 235, 1999. [30] Peter M Shearer. Introduction to seismology. Cambridge university press, 2019. [31] Jonathan D. Smith, Kamyar Azizzadenesheli, and Zachary E. Ross. Eikonet: Solving the eikonal equation with deep neural networks, 2020. [32] He Sun and Katherine L Bouman. Deep probabilistic imaging: Uncertainty quantification and multi-modal solution characterization for computational imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2628 2637, 2021. [33] Eran Treister and Eldad Haber. A fast marching algorithm for the factored eikonal equation. Journal of Computational Physics, 324:210 225, 11 2016. [34] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. [35] Mingyang Xie, Yu Sun, Jiaming Liu, Brendt Wohlberg, and Ulugbek S. Kamilov. Joint reconstruction and calibration using regularization by denoising, 2020. [36] Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola. Dive into deep learning, 2021.