# robust_compressed_sensing_using_generative_models__299821fb.pdf Robust Compressed Sensing using Generative Models Ajil Jalal ECE, UT Austin ajiljalal@utexas.edu Liu Liu ECE, UT Austin liuliu@utexas.edu Alexandros G. Dimakis ECE, UT Austin dimakis@austin.utexas.edu Constantine Caramanis ECE, UT Austin constantine@utexas.edu The goal of compressed sensing is to estimate a high dimensional vector from an underdetermined system of noisy linear equations. In analogy to classical compressed sensing, here we assume a generative model as a prior, that is, we assume the vector is represented by a deep generative model G : Rk Rn. Classical recovery approaches such as empirical risk minimization (ERM) are guaranteed to succeed when the measurement matrix is sub-Gaussian. However, when the measurement matrix and measurements are heavy-tailed or have outliers, recovery may fail dramatically. In this paper we propose an algorithm inspired by the Median-of-Means (MOM). Our algorithm guarantees recovery for heavy-tailed data, even in the presence of outliers. Theoretically, our results show our novel MOM-based algorithm enjoys the same sample complexity guarantees as ERM under sub-Gaussian assumptions. Our experiments validate both aspects of our claims: other algorithms are indeed fragile and fail under heavy-tailed and/or corrupted data, while our approach exhibits the predicted robustness. 1 Introduction Compressive or compressed sensing is the problem of reconstructing an unknown vector x Rn after observing m < n linear measurements of its entries, possibly with added noise: y = Ax + η, where A Rm n is called the measurement matrix and η Rm is noise. Even without noise, this is an underdetermined system of linear equations, so recovery is impossible without a structural assumption on the unknown vector x . The vast literature [84, 37, 72, 9, 18, 27, 2, 86, 11] on this subject typically assumes that the unknown vector is natural, or simple, in some applicationdependent way. Compressed sensing has been studied on a wide variety of structures such as sparse vectors [19], trees [20], graphs [90], manifolds [21, 89] or deep generative models [15]. In this paper, we concentrate on deep generative models, which were explored by [15] as priors for sample-efficient reconstruction. Theoretical results in [15] showed that if x lies close to the range of a generative model G : Rk Rn with d layers, a variant of ERM can recover x with m = O(kd log n) measurements. Empirically, [15] shows that generative models require 5 10 fewer measurements to obtain the same reconstruction accuracy as Lasso. This impressive empirical performance has motivated significant recent research to better understand the behaviour and theoretical limits of compressed sensing using generative priors [36, 50, 62] A key technical condition for recovery is the Set Restricted Eigenvalue Condition (S-REC) [15], which is a generalization of the Restricted Eigenvalue Condition [14, 17] in sparse recovery. This Link to our code: https://github.com/ajiljalal/csgm-robust-neurips 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. condition is satisfied if A is a sub-Gaussian matrix and the measurements satisfy y = Ax + η. This leads to the question: can the conditions on A be weakened, and can we allow for outliers in y and A? This has significance in applications such as MRI and astronomical imaging, where data is often very noisy and requires significant pruning/cleansing. As we show in this paper, the analysis and algorithm proposed by [15] are quite fragile in the presence of heavy-tailed noise or corruptions in the measurements. In the statistics literature, it is well known that algorithms such as empirical risk minimization (ERM) and its variants are not robust to even a single outlier. Since the algorithm in [15] is a variant of ERM, it is susceptible to the same failures in the presence of heavy-tails and outliers. Indeed, as we show empirically in Section 6, precisely this occurs. Importantly, recovery failure in the setting of [15] (which is also the focus of this paper) can be pernicious, precisely because generative models (by design) output images in their range space, and for well-designed models, these have high perceptual quality. In contrast, when a classical algorithm like LASSO [84] fails, the typical failure mode is the output of a non-sparse vector. Thus in the context of generative models, resilience to outliers and heavy-tails is especially critical. This motivates the need for algorithms that do not require strong assumptions on the measurements. In this paper, we propose an algorithm for compressed sensing using generative models, which is robust to heavy-tailed distributions and arbitrary outliers. We study its theoretical recovery guarantees as well as empirical performance, and show that it succeeds in scenarios where other existing recovery procedures fail, without additional cost in sample complexity or computation. 1.1 Contributions We propose a new reconstruction algorithm in place of ERM. Our algorithm uses a Median-of-Means (MOM) loss to provide robustness to heavy-tails and arbitrary corruptions. As S-REC may no longer hold, we necessarily use a different analytical approach. We prove recovery results and sample complexity guarantees for this setting even though previous assumptions such as the S-REC [15] condition do not hold. Specifically, our main contributions are as follows. (Algorithm) We consider robust compressed sensing for generative models where (i) a constant fraction of the measurements and measurement matrix are arbitrarily (perhaps maliciously) corrupted and (ii) the random ensemble only satisfies a weak moment assumption. We propose a novel algorithm to replace ERM. Our algorithm uses a median-of-means (MOM) tournament [65, 54] i.e., a min-max optimization framework for robust reconstruction. Each iteration of our MOM-based algorithm comes at essentially no additional computational cost compared to an iteration of standard ERM. Moreover, as our code shows, it is straightforward to implement. (Analysis and Guarantees) We analyze the recovery guarantee and outlier-robustness of our algorithm when the generative model is a d-layer neural network using Re LU activations. Specifically, in the presence of a constant fraction of outliers in y and A, we achieve G(bz) G(z ) 2 O(σ2 + τ) with sample size m = O(kd log n), where σ2 is the variance of the heavy-tailed noise, and τ is the optimization accuracy. Using different analytical tools (necessarily, since we do not assume sub-Gaussianity), we show our algorithm, even under heavy-tails and corruptions, has the same sample complexity as the previous literature has achieved under much stronger sub-Gaussian assumptions. En route to our result, we also prove an interesting result for ERM: by avoiding the S-REC-based analysis, we show that the standard ERM algorithm does in fact succeed in the presence of a heavy-tailed measurement matrix, thereby strengthening the best-known recovery guarantees from [15]. This does not extend (as our empirical results demonstrate) to the setting of outliers, or of heavy-tailed measurement noise. For these settings, our new algorithm is required. (Empirical Support) We empirically validate the effectiveness of our robust recovery algorithm on MNIST and Celeb A-HQ. Our results demonstrate that (as our theory predicts) our algorithm succeeds in the presence of heavy-tailed noise, heavy-tailed measurements, and also in the presence of arbitrary outliers. At the same time our experiments confirm that ERM can fail, and in fact fails dramatically: through an experiment on the Celeb A-HQ data set, we demonstrate that the ERM recovery approach [15], as well as other natural approaches including ℓ1 loss minimization and trimmed loss minimization [81], can recover images that have little resemblance to the original. 1.2 Related work Compressed sensing with outliers or heavy-tails has a long history. To deal with outliers only in y, classical techniques replace the ERM with a robust loss function such as ℓ1 loss or Huber loss [58, 74, 64, 24], and obtain the optimal statistical rates. Much less is known for outliers in y and A for robust compressed sensing. Recent progress on robust sparse regression [22, 10, 23, 26, 78, 60, 59, 81] can handle outliers in y and A, but their techniques cannot be directly extended to arbitrary generative models G. Another line of research [43, 70, 65, 54] considers compressed sensing where the measurement matrix A and y have heavy-tailed distributions. Their techniques leverage variants of Median-of-Means (MOM) estimators on the loss function under weak moment assumptions instead of sub-Gaussianity, which generalize the classical MOM mean estimator in one dimension [73, 48, 3, 70]. [88] deals with compressed sensing of generative models when the measurements and the responses are non-Gaussian. However, the distribution model in [88] requires more stringent conditions compared to the weak moment assumption as will be specified in Definition 1, and their algorithm cannot tolerate arbitrary corruptions. Generative priors have shown great promise in compressed sensing and other inverse problems, starting with [15], who generalized the theoretical framework of compressive sensing and restricted eigenvalue conditions [84, 27, 14, 17, 41, 13, 12, 28] for signals lying on the range of a deep generative model [33, 53]. Results in [50, 62, 47] established that the sample complexities in [15] are order optimal. The approach in [15] has been generalized to tackle different inverse problems [35, 8, 6, 71, 7, 79, 8, 61, 5, 46, 34, 4]. Alternate algorithms for reconstruction include [16, 25, 49, 30, 29, 82, 66, 25, 77, 38, 39]. The complexity of optimization algorithms using generative models have been analyzed in [32, 40, 57, 36]. See [75] for a more detailed survey on deep learning techniques for compressed sensing. A related line of work has explored learning-based approaches to tackle classical problems in algorithms and signal processing [1, 45, 69, 42]. For functions f(n) and g(n), we write f(n) g(n) to denote that there exists a universal constant c1 > 0 such that f(n) c1g(n). Similarly, we write f(n) g(n) to denote that there exists a universal constant c2 > 0 such that f(n) c2g(n). We write f(n) = O(g(n)) to imply that there exists a positive constant c3 and a natural number n0 such that for all n n0, we have |f(n)| c3g(n). Similarly, we write f(n) = Ω(g(n)) to imply that there exists a positive constant c4 and a natural number n1 such that for all n n1, we have |f(n)| c4g(n). 3 Problem formulation Let x = G(z ) Rn be the fixed vector of interest. The deep generative model G : Rk Rn (k n) maps from a low dimensional latent space to a higher dimensional space. In this paper, G is a feedforward neural network with Re LU activations and d layers. Our definition of heavy-tailed samples assumes that the measurement matrix A only has bounded fourth moment. Our corruption model is Huber s ϵ-contamination model [44]. This model allows corruption in the measurement matrix A and measurements y. Precisely, these are: Definition 1 (Heavy-tailed samples). We say that a random vector a is heavy-tailed if for a universal constant C > 0, the 4th moment of a satisfies 4 C E a, u 2 1 For all δ > 0, the (4 + δ)th moment of a need not exist, and we make no assumptions on them. Definition 2 (ϵ-corrupted samples). We say that a collection of samples {yi, ai} is ϵ-corrupted if they are i.i.d. observations drawn from the mixture {yi, ai} (1 ϵ)P + ϵQ, where P is the uncorrupted distribution, Q is an arbitrary distribution. Thus we assume that samples {yi, ai}m i=1 are generated from (1 ϵ)P +ϵQ, where Q is an adversary, and P satisfies the following: Assumption 1. Samples (yi, ai) P satisfy yi = a i G(z ) + ηi, where the random vector ai is isotropic and heavy-tailed as in Definition 2, and the noise term ηi is independent of ai, i.i.d. with zero mean and bounded variance σ2. 4 Our Algorithm refers to ℓ2 unless specified otherwise. The procedure proposed by [15] finds a reconstruction ˆx = G(ˆz), where ˆz solves: bz := arg min z Rk AG(z) y 2. This is essentially an ERM-based approach. As is well known from the classical statistics literature, ERM s success relies on strong concentration properties, guaranteed, e.g., if the data are all sub Gaussian. ERM may fail, however, in the presence of corruption or heavy-tails. Indeed, our experiments demonstrate that in the presence of outliers in y or A, or heavy-tailed noise in y, [15] fails to recover G(z ). Remark Unlike typical problems in M-estimation and high dimensional statistics, the optimization problem that defines the recovery procedure here is non-convex, and thus in the worst case, computationally intractable. Interestingly, despite non-convexity, as demonstrated in [15], (some appropriate version of) gradient descent is empirically very successful. In this paper, we take this as a computational primitive, thus sidestepping the challenge of proving whether a gradient-descent based method can efficiently provide guaranteed inversion of a generative model. Our theoretical guarantees are therefore statistical but our experiments show empirically excellent performance. 4.1 MOM objective It is well known that the median of means estimator achieves nearly sub-Gaussian concentration for one dimensional mean estimation of variance bounded random variables [73, 48, 3]. Inspired by the median-of-means algorithm, we propose the following algorithm to handle heavy-tails and outliers in y and A. We partition the set [m] into M disjoint batches {B1, B2, , BM} such that each batch has cardinality b = m M . Without loss of generality, we assume that M exactly divides m, so that b is an integer. For the jth batch Bj, define the function b ABj G(z) y Bj 2, (1) where ABj Rb n denotes the submatrix of A corresponding to the rows in batch Bj. Similarly, y Bj Rb denotes the entries of y corresponding to the batch Bj. Our workhorse is a novel variant of median-of-means (MOM) tournament procedure [65, 54] using the loss function eq. (1): bz = arg min z Rk max z Rk median 1 j M (ℓj(z) ℓj(z )). (2) We do not assume that the minimizer is unique, since we only require a reconstruction G(bz) which is close to G(z ). Any value of z in the set of minimizers will suffice. The intuition behind this aggregation of batches is that if the inner player z chooses a point close to z , then the outer player z must also choose a point close to z in order to minimize the objective. Once this happens, there is no better option for z . Hence a neighborhood around z is almost an equilibrium, and in fact there can be no neighborhood far from z with such an equilibrium. Computational considerations. The objective function eq. (2) is not convex and we use Algorithm 1 as a heuristic to solve eq. (2). In Section 6, we empirically observe that gradient-based methods are able to minimize this objective and have good convergence properties. Our main theorem guarantees that a small value of the objective implies a good reconstruction and hence we can certify reconstruction quality using the obtained final value of the objective. 5 Theoretical results We begin with a brief review of the Restricted Eigenvalue Condition in standard compressed sensing and show that S-REC is satisfied by heavy-tailed distributions. Algorithm 1 Robust compressed sensing of generative models 1: Input: Data samples {yj, aj}m j=1. 2: Output: G(bz). 3: Parameters: Number of batches M. 4: Initialize z and z . 5: for t = 0 to T 1, do 6: For each batch j [M], calculate 1 |Bj|(ℓj(z) ℓj(z )) by eq. (1). 7: Pick the batch with the median loss median 1 j M (ℓj(z) ℓj(z )), and evaluate the gradient for z and z using backpropagation on that batch. (i) perform gradient descent for z; (ii) perform gradient ascent for z . 8: end for 9: Output the G(bz) = G(z). 5.1 Set-Restricted Eigenvalue Condition for heavy-tailed distributions Most theoretical guarantees for compressed sensing rely on variants of the Restricted Eigenvalue Condition(REC) [14, 17] and the closest to our setting is the Set Restricted Eigenvalue Condition [15](SREC). Formally, A Rm n satisfies S-REC(S, γ, δ) on a set S Rn if for all x1, x2 S, Ax1 Ax2 γ x1 x2 δ. While we can prove many powerful results using the REC condition, proving that a matrix satisfies REC typically involves sub-Gaussian entries in A. If we don t have sub-Gaussianity, proving REC requires a finer analysis. A recent technique called the small-ball method [67] requires significantly weaker assumptions on A, and can be used to show REC [67, 85] for A satisfying Assumption 1. While this technique can be used for sparse vectors, we do not have a general understanding of what structures it can handle, since existing proofs make heavy use of sparsity. We now show that a random matrix whose rows satisfy Assumption 1 will satisfy S-REC over the range of a generator G : Rk Rn with high probability. This generalizes Lemma 4.2 in [15] the original lemma required i.i.d. sub-Gaussian entries in the matrix A, whereas the following lemma only needs the rows to have bounded fourth moments. Lemma 5.1. Let G : Rk Rn be a d layered neural network with Re LU activations. Let A Rm n be a matrix with i.i.d. rows satisfying Definition 1. For any γ < 1, if m = Ω 1 1 γ2 kd log n , then with probability 1 e Ω(m), for all z1, z2 Rk, we have 1 m AG(z1) AG(z2) 2 γ2 G(z1) G(z2) 2. This implies that the ERM approach of [15] still works when we only have a heavy-tailed measurement matrix A. However, as we show in our experiments, heavy-tailed noise in y and outliers in y, A will make ERM fail catastrophically. In order to solve this problem, we leverage the median-of-means tournament defined in eq. (2), and we will now show it is robust to heavy-tails and outliers in y, A. 5.2 Main results We now present our main result. Theorem 5.5 provides recovery guarantees in terms of the error in reconstruction in the presence of heavy-tails and outliers, where bz is the (approximate) minimizer of eq. (2). First we show that the minimum value of the objective in eq. (2) is indeed small if there are no outliers. Lemma 5.2. Let M denote the number of batches. Assume that the measurements y and measurement matrix A are drawn from the uncorrupted distribution satisfying Assumption 1. Then with probability 1 e Ω(M), the objective in Equation (2) satisfies min z Rk max z Rk median 1 j M (ℓBj(z) ℓBj(z )) 4σ2. (3) We now introduce Lemma 5.3 and Lemma 5.4, which control two stochastic processes that appear in eq. (2). We show that minimizing the objective in eq. (2) implies that you are close to the unknown vector G(z ). Notice that since z is one feasible solution of the inner maximization step of z , we can consider z = z . Now consider the difference of square losses in eq. (2), which is given by: ℓj(bz) ℓj(z ) = 1 b ABj G(bz) y Bj 2 1 b ABj G(z ) y Bj 2, b ABj(G(bz) G(z )) 2 2 b η Bj(ABj(G(bz) G(z ))), where the last line follows from an elementary arithmetic manipulation. Assume we have the following bounds on a majority of batches: 1 b ABj(G(bz) G(z )) 2 G(bz) G(z ) 2, (4) b η Bj(ABj(G(bz) G(z ))) G(bz) G(z ) . (5) Since the objective is the median of the sum of the above terms, a small value of the objective implies that G(bz) G(z ) is small. We formally show these bounds in Lemma 5.3, Lemma 5.4. Lemma 5.3. Let G : Rk Rn be a generative model from a d-layer neural network using Re LU activations. Let A Rm n be a matrix with i.i.d. uncorrupted rows satisfying Definition 1. Let the batch size b = Θ C4 , let the number of batches satisfy M = Ω(kd log n), and let γ be a constant which depends on the moment constant C. Then with probability at least 1 e Ω(m), for all z1, z2 Rk there exists a set J [M] of cardinality at least 0.9M such that 1 b ABj(G(z1) G(z2)) 2 γ2 G(z1) G(z2) 2 , j J. Lemma 5.4. Consider the setting of Lemma 5.3 with measurements satisfying y = AG(z ) + η, where y, A, η satisfy Assumption 1 with noise variance σ2. For a constant batch size b and number of batches M = Ω(kd log n), with probability at least 1 e Ω(m), for all z Rk there exists a set J [M] of cardinality at least 0.9M such that 1 b |ηT Bj ABj(G(z) G(z ))| σ G(z) G(z ) , j J. The above lemmas do not account for the ϵ corrupted samples in Definition 2. However, since the batch size is constant in both the lemmas, there exists a value of ϵ such that sufficiently many batches have no corruptions. Hence we can apply Lemma 5.3, Lemma 5.4 to these uncorrupted batches. Using these lemmas with a constant batch size b, we obtain Theorem 5.5. We defer its proof to Appendix E. Theorem 5.5. Let G : Rk Rn be a generative model from a d-layer neural network using Re LU activations. There exists a (sufficiently small) constant fraction ϵ which depends on the moment constant C in Definition 1 such that the following is true. We observe m = O(kd log n) ϵ-corrupted samples from Definition 2, under Assumption 1. For any z Rk, let bz minimize the objective function given by eq. (2) to within additive τ of the optimum. Then there exists a (sufficiently large) constant c, such that with probability at least 1 e Ω(m), the reconstruction G(ˆz) satisfies G(bz) G(z ) 2 c(σ2 + τ), where σ2 is the variance of noise under Assumption 1. We briefly discuss the implications of Theorem 5.5, with regards to sample complexity and error in reconstruction. Sample Complexity. Our sample complexity matches that of [15] up to constant factors. This shows that the minimizer of eq. (2) in the presence of heavy-tails and outliers provides the same guarantees as in the case of ERM with sub-Gaussian measurements. Statistical accuracy and robustness. Let us analyze the error terms in our theorem. The term τ is a consequence of the minimization algorithm not being perfect, since it only reaches within τ of the true minimum. Hence it cannot be avoided. The term σ2 is due to the noise in measurements. In the Number of measurements 0 50 100 150 200 250 300 350 Reconstruction error (per pixel) ERM Our Algorithm (a) Results on MNIST. Number of measurements 0 100 200 300 400 500 600 Reconstruction error (per pixel) ERM Our Algorithm (b) Results on Celeb A-HQ Figure 1: We compare Algorithm 1 with the baseline ERM [15] under the heavy-tailed setting without arbitrary outliers. We fix k = 100 for the MNIST dataset and k = 512 for the Celeb A-HQ dataset. We vary the number of measurements, and plot the reconstruction error per pixel averaged over multiple trials. With increasing number of measurements, we observe the reconstruction error decreases. For heavy-tailed y and A without arbitrary outliers, our method obtains significantly smaller reconstruction error in comparison to ERM. main result of [15], the reconstruction G(bz) has error bounded by G(bz) G(z ) 2 η 2/m+τ.2 This gives the following conditions: If η is sub-Gaussian with variance σ2, then η 2/m σ2 with high probability. Hence our bounds match up to constants. If higher order moments of η do not exist, an application of Chebyshev s inequality says that with probability 1 δ, [15] has G(z ) G(bz) 2 σ2/(mδ), and this can be extremely large if we want δ = e Ω(m). Hence our method is clearly superior if η only has bounded variance, and if η is sub-Gaussian, then our bounds match up to constants. In the presence of corruptions, [15] has no provable guarantee. 6 Experiments 0 500 1000 1500 2000 2500 Iterations Reconstruction error ERM L1 loss minimization MOM tournament MOM minimization Trimmed loss minimization 0 50 100 150 200 250 300 Wall-clock time (seconds) Reconstruction error ERM L1 loss minimization MOM tournament MOM minimization Trimmed loss minimization Figure 2: Plot of the reconstruction error versus the iteration number (left) and plot of the reconstruction error versus wall-clock time (right). ERM [15] and ℓ1 minimization fail to converge. Our two proposed methods, MOM tournament(blue) and MOM minimization(red), have the smallest reconstruction error. We provide a theoretical analysis for the MOM tournament algorithm, and observe that direct minimization of the MOM objective also works in practice. The computation time of our algorithms is nearly the same as the baselines. In this section, we study the empirical performance of our algorithm on generative models trained on real image datasets. We show that we can reconstruct images under heavy-tailed samples and arbitrary outliers. For additional experiments and experimental setup details, see Appendix F. Heavy-Tailed Samples In this experiment, we deal with the uncorrupted compressed sensing model P, which has heavy-tailed measurement matrix and stochastic noise: y = AG(z )+η. We use 2In [15], the bound is stated as η 2, but our A has a different scaling, and hence the correct bound in our setting is η 2/m. a Student s t distribution (a typical example of heavy-tails) for A and η. We compare Algorithm 1 with the baseline ERM [15] for heavy-tailed data without arbitrary corruptions on MNIST [55] and Celeb A-HQ [51, 63]. We trained a DCGAN [80] with k = 100 and d = 5 layers to produce 64 64 MNIST images. For Celeb A-HQ, we used a PG-GAN [51] with k = 512 to produce images of size 256 256 3 = 196, 608. We vary the number of measurements m and obtain the reconstruction error G(bz) G(z ) 2/n for Algorithm 1 and ERM, where G(z ) is the ground truth image. In Figure 1, Algorithm 1 and ERM both have decreasing reconstruction error per pixel with increasing number of measurements. To conclude, even for heavy-tailed noise without arbitrary outliers, Algorithm 1 obtains significantly smaller reconstruction error when compared to ERM. Arbitrary corruptions. In this experiment, we use the same heavy-tailed samples as above, and we add ϵ = 0.02-fraction of arbitrary corruption. We set the outliers of measurement matrix A as random sign matrix, and the outliers of y are fixed to be 1. We note that we don t use any targeted attack to simulate the outliers. We perform our experiments on the Celeb A-HQ dataset using a PG-GAN of latent dimension k = 512, and fix the number of measurements to m = 1000. We compare our algorithm to a number of natural baselines. Our first baseline is ERM [15] which is not designed to deal with outliers. While its fragility is interesting to note, in this sense it is not unexpected. For outliers in y, classical robust methods replace the loss function by an ℓ1 loss function or Huber loss function. This is done in order to avoid the squared loss, which makes recovery algorithms very sensitive to outliers. In this case, we have bz := arg min y AG(z) 1. We also investigate the performance of trimmed loss minimization, which is a recent algorithm proposed by [81]. This algorithm picks the t fraction of samples with smallest empirical loss for each update step, where t is a hyper-parameter. We run Algorithm 1 and its variant MOM minimization. The MOM minimization directly minimizes bz = arg min z Rk median1 j M(ℓj(z)), (6) and we use gradient-based methods similar to Algorithm 1 to solve it. Since Algorithm 1 optimizes z and z in one iteration, the actual computation time of MOM tournament is twice that of MOM minimization. As shown in Figure 2, Figure 3, ERM [15] and ℓ1 loss minimization fail to converge to the ground truth and in particular, they may recover a completely different person. Trimmed loss minimization [81] only succeeds on occasion, and when it fails, it obtains a visibly different person. The convergence of the MOM minimization per iteration is very similar to the MOM tournament, and they both achieve much smaller reconstruction error compared to trimmed loss minimization. The right panel of Figure 2 plots the reconstruction error versus the actual computation time, showing our algorithms match baselines. We plot the MSE vs. number of measurements in Figure 4b, where the fraction of corruptions is set to ϵ = 0.02. Miscellaneous Experiments Is ERM ever better than MOM? So far we have analyzed cases where MOM performs better than ERM. Since ERM is known to be optimal in linear regression when dealing with uncorrupted sub-Gaussian data, we expect it to be superior to MOM when our measurements are all sub-Gaussian. We evaluate this in Fig. 4a and observe that ERM obtains smaller MSE in this setting. Notice that as we reduce the number of batches in MOM, it approaches ERM. How sensitive is MOM to the number of batches? In Figure 4c we study the MSE of MOM tournaments and MOM minimization as we vary the number of batches. In order to select the optimal number of batches (M), we keep a set of validation measurements that we do not use in the optimization routines for estimating x. We can run MOM for different value of M to get multiple reconstructions, and then evaluate each reconstruction using the validation measurements to pick the best reconstruction. Note that one should use the median-of-means loss while evaluating the validation error as well. 7 Conclusion The phenomenon observed in Figure 3 highlights the importance of our method. Our work raises several questions about why the objective we consider can be minimized, and suggests we need a new Figure 3: Reconstruction results on Celeb A-HQ for m = 1000 measurements with 20 corrupted measurements. For each row, the first column is ground truth from a generative model. Subsequent columns show reconstructions by ERM [15], ℓ1 minimization, trimmed loss minimization [81]. In particular, vanilla ERM, ℓ1 minimization obtain completely different faces. Since we use the same outlier for different rows, vanilla ERM produces the same reconstruction irrespective of the ground truth. Trimmed loss minimization only succeeds on occasion (the last row), and when it fails, it obtains a similar but still different face. The last two columns show reconstructions by our proposed algorithms. The second to last one is directly minimizing the MOM objective eq. (6), and the last column minimizes the MOM tournament objective eq. (2). We provide a theoretical analysis for the MOM tournaments algorithm, and observe that direct minimization of the MOM objective also works in practice. We observe the last two columns have much better reconstruction performance we get a high quality reconstruction under heavy-tailed measurements and arbitrary outliers. paradigm for analysis that accounts for similar instances that enjoy empirical success, even though they can be provably hard in the worst case. 1000 1500 2000 number of measurements (m) MOM, more batches MOM, fewer batches ERM (a) ERM vs MOM 1000 2000 2500 number of measurements (m) mom_tournament (b) MOM on arbitrary corruptions 340 102 68 51 40 number of batches (M) mom_tournament mom_direct (c) MSE vs. Number of batches Figure 4: (a) We compare ERM and MOM by plotting MSE vs number of measurements when the measurements are sub-Gaussian without corruptions. (b) Aggregate statistics for MOM in the presence of corruptions. (c) MSE vs number of batches for MOM on 1000 heavy-tailed measurements and 20 corruptions. All error bars indicate 95% confidence intervals. Plots use a PGGAN on Celeb A-HQ. 8 Acknowledgments Ajil Jalal and Alex Dimakis have been supported by NSF Grants CCF 1763702,1934932, AF 1901292, 2008710, 2019844 research gifts by NVIDIA, Western Digital, WNCG IAP, computing resources from TACC and the Archie Straiton Fellowship. Constantine Caramanis and Liu Liu have been supported by NSF Award 1704778 and a grant from the Army Futures Command. 9 Broader Impact Sparsity has played an important role across many areas of statistics, engineering and computer science, as a regularizing prior that captures important structure in many applications. Recent work has illustrated that given enough data, deep generative models are poised to play a revolutionary role, as a modern, data-driven replacement for sparsity. Much work remains to bring this agenda to fruition, but we believe that, as a variety of recent works have suggested, this direction can revolutionize imaging in a number of different important domains, not least of all, medical imaging. This work addresses the robustness, and hence the trustworthiness and reliability of GAN-inversionbased techniques. As mentioned, this is especially critical, since high quality GANs will always produce perceptually high quality images, hence recovery failures may not be readily detectable by inspection. Still, many significant issues remain that this work does not address. This includes understanding when and how sufficiently powerful and expressive GANs can be trained, since the scope of high quality GANs still appears to be limited. Another important consideration includes the core computational issue: the GAN inversion problem, which this work also faces, is intractable in the worst case, yet in practice appears to not pose a significant challenge. Understanding this dichotomy is very important. [1] Anders Aamand, Piotr Indyk, and Ali Vakilian. (learned) frequency estimation algorithms under zipfian distribution. ar Xiv preprint ar Xiv:1908.05198, 2019. [2] Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright. Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Statist., 40(5):2452 2482, 10 2012. [3] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and system sciences, 58(1):137 147, 1999. [4] Rushil Anirudh, Jayaraman J Thiagarajan, Bhavya Kailkhura, and Timo Bremer. Mimicgan: Robust projection onto image manifolds with corruption mimicking. ar Xiv preprint ar Xiv:1912.07748, 2019. [5] Muhammad Asim, Ali Ahmed, and Paul Hand. Invertible generative models for inverse problems: mitigating representation error and dataset bias. ar Xiv preprint ar Xiv:1905.11672, 2019. [6] Muhammad Asim, Fahad Shamshad, and Ali Ahmed. Blind image deconvolution using deep generative priors. ar Xiv preprint ar Xiv:1802.04073, 2018. [7] Muhammad Asim, Fahad Shamshad, and Ali Ahmed. Solving bilinear inverse problems using deep generative priors. Co RR, abs/1802.04073, 3(4):8, 2018. [8] Benjamin Aubin, Bruno Loureiro, Antoine Baker, Florent Krzakala, and Lenka Zdeborová. Exact asymptotics for phase retrieval and compressed sensing with random generative priors. ar Xiv preprint ar Xiv:1912.02008, 2019. [9] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends R in Machine Learning, 4(1):1 106, 2012. [10] Sivaraman Balakrishnan, Simon S. Du, Jerry Li, and Aarti Singh. Computationally efficient robust sparse estimation in high dimensions. In Proceedings of the 2017 Conference on Learning Theory, 2017. [11] Richard G Baraniuk. Compressive sensing [lecture notes]. IEEE signal processing magazine, 24(4):118 121, 2007. [12] Richard G Baraniuk, Volkan Cevher, Marco F Duarte, and Chinmay Hegde. Model-based compressive sensing. IEEE Transactions on Information Theory, 56(4):1982 2001, 2010. [13] Richard G Baraniuk and Michael B Wakin. Random projections of smooth manifolds. Foundations of computational mathematics, 9(1):51 77, 2009. [14] Peter J Bickel, Ya acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705 1732, 2009. [15] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pages 537 546. JMLR. org, 2017. [16] Ashish Bora, Eric Price, and Alexandros G Dimakis. Ambientgan: Generative models from lossy measurements. ICLR, 2:5, 2018. [17] Emmanuel J Candes. The restricted isometry property and its implications for compressed sensing. Comptes rendus mathematique, 346(9-10):589 592, 2008. [18] Emmanuel J Candès, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 52(2):489 509, 2006. [19] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207 1223, 2006. [20] Chen Chen and Junzhou Huang. Compressive sensing mri with wavelet tree sparsity. In Advances in neural information processing systems, pages 1115 1123, 2012. [21] Minhua Chen, Jorge Silva, John Paisley, Chunping Wang, David Dunson, and Lawrence Carin. Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds. IEEE Transactions on Signal Processing, 58(12):6140 6155, 2010. [22] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under adversarial corruption. In International Conference on Machine Learning, pages 774 782, 2013. [23] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):1 25, 2017. [24] Arnak Dalalyan and Philip Thompson. Outlier-robust estimation of a sparse linear model using ℓ1-penalized huber s m-estimator. In Advances in Neural Information Processing Systems, pages 13188 13198, 2019. [25] Manik Dhar, Aditya Grover, and Stefano Ermon. Modeling sparse deviations for compressed sensing using generative models. ar Xiv preprint ar Xiv:1807.01442, 2018. [26] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. ar Xiv preprint ar Xiv:1803.02815, 2018. [27] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289 1306, 2006. [28] Yonina C Eldar and Moshe Mishali. Robust recovery of signals from a structured union of subspaces. IEEE Transactions on Information Theory, 55(11):5302 5316, 2009. [29] Alyson K Fletcher, Parthe Pandit, Sundeep Rangan, Subrata Sarkar, and Philip Schniter. Plug-in estimation in high-dimensional linear inverse problems: A rigorous analysis. In Advances in Neural Information Processing Systems, pages 7440 7449, 2018. [30] Alyson K Fletcher, Sundeep Rangan, and Philip Schniter. Inference in deep networks in high dimensions. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 1884 1888. IEEE, 2018. [31] Evarist Giné and Joel Zinn. Some limit theorems for empirical processes. The Annals of Probability, pages 929 989, 1984. [32] Fabian Latorre Gómez, Armin Eftekhari, and Volkan Cevher. Fast and provable admm for learning with generative priors. ar Xiv preprint ar Xiv:1907.03343, 2019. [33] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014. [34] Paul Hand and Babhru Joshi. Global guarantees for blind demodulation with generative priors. In Advances in Neural Information Processing Systems, pages 11531 11541, 2019. [35] Paul Hand, Oscar Leong, and Vlad Voroninski. Phase retrieval under a generative prior. In Advances in Neural Information Processing Systems, pages 9136 9146, 2018. [36] Paul Hand and Vladislav Voroninski. Global guarantees for enforcing deep generative priors by empirical risk. ar Xiv preprint ar Xiv:1705.07576, 2017. [37] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2015. [38] Reinhard Heckel and Paul Hand. Deep decoder: Concise image representations from untrained non-convolutional networks. ar Xiv preprint ar Xiv:1810.03982, 2018. [39] Reinhard Heckel and Mahdi Soltanolkotabi. Compressive sensing with un-trained neural networks: Gradient descent finds the smoothest approximation. ar Xiv preprint ar Xiv:2005.03991, 2020. [40] Chinmay Hegde. Algorithmic aspects of inverse problems using generative models. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 166 172. IEEE, 2018. [41] Chinmay Hegde, Michael Wakin, and Richard G Baraniuk. Random projections for manifold learning. In Advances in neural information processing systems, pages 641 648, 2008. [42] Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. Learning-based frequency estimation algorithms. 2018. [43] Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17(1):543 582, 2016. [44] Peter J Huber. Robust estimation of a location parameter. The annals of mathematical statistics, pages 73 101, 1964. [45] Piotr Indyk, Ali Vakilian, and Yang Yuan. Learning-based low-rank approximations. In Advances in Neural Information Processing Systems, pages 7400 7410, 2019. [46] Gauri Jagatap and Chinmay Hegde. Phase retrieval using untrained neural network priors. 2019. [47] Shirin Jalali and Xin Yuan. Solving linear inverse problems using generative models. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 512 516. IEEE, 2019. [48] Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169 188, 1986. [49] Maya Kabkab, Pouya Samangouei, and Rama Chellappa. Task-aware compressed sensing with generative adversarial networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [50] Akshay Kamath, Sushrut Karmalkar, and Eric Price. Lower bounds for compressed sensing with generative models. ar Xiv preprint ar Xiv:1912.02938, 2019. [51] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. ar Xiv preprint ar Xiv:1710.10196, 2017. [52] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [53] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [54] Guillaume Lecué and Matthieu Lerasle. Robust machine learning by median-of-means: theory and practice. ar Xiv preprint ar Xiv:1711.10306, 2017. [55] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [56] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013. [57] Qi Lei, Ajil Jalal, Inderjit S Dhillon, and Alexandros G Dimakis. Inverting deep generative models, one layer at a time. In Advances in Neural Information Processing Systems, pages 13910 13919, 2019. [58] Xiaodong Li. Compressed sensing and matrix completion with constant proportion of corruptions. Constructive Approximation, 37(1):73 99, 2013. [59] Liu Liu, Tianyang Li, and Constantine Caramanis. High dimensional robust m-estimation: Arbitrary corruption and heavy tails. ar Xiv preprint ar Xiv:1901.08237, 2019. [60] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust sparse regression. ar Xiv preprint ar Xiv:1805.11643, 2018. [61] Zhaoqiang Liu, Selwyn Gomes, Avtansh Tiwari, and Jonathan Scarlett. Sample complexity bounds for 1-bit compressive sensing and binary stable embeddings with generative priors. ar Xiv preprint ar Xiv:2002.01697, 2020. [62] Zhaoqiang Liu and Jonathan Scarlett. Information-theoretic lower bounds for compressive sensing with generative models. ar Xiv preprint ar Xiv:1908.10744, 2019. [63] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15:2018, 2018. [64] Po-Ling Loh. Statistical consistency and asymptotic normality for high-dimensional robust m-estimators. The Annals of Statistics, 45(2):866 896, 2017. [65] Gabor Lugosi and Shahar Mendelson. Risk minimization by median-of-means tournaments. ar Xiv preprint ar Xiv:1608.00757, 2016. [66] Morteza Mardani, Enhao Gong, Joseph Y Cheng, Shreyas S Vasanawala, Greg Zaharchuk, Lei Xing, and John M Pauly. Deep generative adversarial neural networks for compressive sensing mri. IEEE transactions on medical imaging, 38(1):167 179, 2018. [67] Shahar Mendelson. Learning without concentration. In Conference on Learning Theory, pages 25 39, 2014. [68] Shahar Mendelson. On aggregation for heavy-tailed classes. Probability Theory and Related Fields, 168(3-4):641 674, 2017. [69] Chris Metzler, Ali Mousavi, and Richard Baraniuk. Learned d-amp: Principled neural network based compressive image recovery. In Advances in Neural Information Processing Systems, pages 1772 1783, 2017. [70] Stanislav Minsker. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308 2335, 2015. [71] Lukas Mosser, Olivier Dubrule, and Martin J Blunt. Stochastic seismic waveform inversion using generative adversarial networks as a geological prior. Mathematical Geosciences, 52(1):53 79, 2020. [72] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538 557, 2012. [73] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983. [74] Nam H Nguyen and Trac D Tran. Exact recoverability from dense corrupted observations via l1-minimization. IEEE transactions on information theory, 59(4):2017 2035, 2013. [75] Gregory Ongie, Ajil Jalal, Christopher A Metzler, Richard G Baraniuk, Alexandros G Dimakis, and Rebecca Willett. Deep learning techniques for inverse problems in imaging. ar Xiv preprint ar Xiv:2005.06001, 2020. [76] REAC Paley and Antoni Zygmund. A note on analytic functions in the unit circle. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 28, pages 266 272. Cambridge University Press, 1932. [77] Parthe Pandit, Mojtaba Sahraee-Ardakan, Sundeep Rangan, Philip Schniter, and Alyson K Fletcher. Inference with deep generative priors in high dimensions. ar Xiv preprint ar Xiv:1911.03409, 2019. [78] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. ar Xiv preprint ar Xiv:1802.06485, 2018. [79] Shuang Qiu, Xiaohan Wei, and Zhuoran Yang. Robust one-bit recovery via relu generative networks: Improved statistical rates and global landscape analysis. ar Xiv preprint ar Xiv:1908.05368, 2019. [80] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. [81] Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. ar Xiv preprint ar Xiv:1810.11874, 2018. [82] Ganlin Song, Zhou Fan, and John Lafferty. Surfing: Iterative optimization over incrementally trained deep networks. In Advances in Neural Information Processing Systems, pages 15008 15017, 2019. [83] Michel Talagrand. A new isoperimetric inequality and the concentration of measure phenomenon. In Geometric Aspects of Functional Analysis, pages 94 124. Springer, 1991. [84] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996. [85] Joel A Tropp. Convex recovery of a structured signal from independent random linear measurements. In Sampling Theory, a Renaissance, pages 67 101. Springer, 2015. [86] Joel A Tropp and Anna C Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on information theory, 53(12):4655 4666, 2007. [87] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019. [88] Xiaohan Wei, Zhuoran Yang, and Zhaoran Wang. On the statistical rate of nonlinear recovery in generative models with heavy-tailed data. In International Conference on Machine Learning, pages 6697 6706, 2019. [89] Weiyu Xu and Babak Hassibi. Compressed sensing over the grassmann manifold: A unified analytical framework. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pages 562 567. IEEE, 2008. [90] Weiyu Xu, Enrique Mallada, and Ao Tang. Compressive sensing over graphs. In 2011 Proceedings IEEE INFOCOM, pages 2087 2095. IEEE, 2011. [91] Jian Zhang and Ioannis Mitliagkas. Yellowfin and the art of momentum tuning. ar Xiv preprint ar Xiv:1706.03471, 2017.