# uncertainty_modeling_in_generative_compressed_sensing__ff632154.pdf

Uncertainty Modeling in Generative Compressed Sensing

Yilang Zhang * 1 Mengchu Xu * 1 Xiaojun Mao 2 Jian Wang 1

Compressed sensing (CS) aims to recover a highdimensional signal with structural priors from its low-dimensional linear measurements. Inspired by the huge success of deep neural networks in modeling the priors of natural signals, generative neural networks have been recently used to replace the hand-crafted structural priors in CS. However, the reconstruction capability of the generative model is fundamentally limited by the range of its generator, typically a small subset of the signal space of interest. To break this bottleneck and thus reconstruct those out-of-range signals, this paper presents a novel method called CS-BGM that can effectively expands the range of generator. Speciﬁcally, CS-BGM introduces uncertainties to the latent variable and parameters of the generator, while adopting the variational inference (VI) and maximum a posteriori (MAP) to infer them. Theoretical analysis demonstrates that expanding the range of generators is necessary for reducing the reconstruction error in generative CS. Extensive experiments show a consistent improvement of CS-BGM over the baselines.

1. Introduction

Compressed sensing (CS) is a paradigm that recovers an unknown signal x Rn from a small number of linear measurements y = Ax + n, (1)

where A Rm n (m n) is the measurement or sensing matrix, and n Rm is the additive white Gaussian noise (AWGN). Since the data acquisition of CS allows for a sampling rate far below the Nyquist rate, it provides

*Equal contribution 1School of Data Science, Fudan University, Shanghai, China 2School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, China. Correspondence to: Xiaojun Mao and Jian Wang <maoxj@sjtu.edu.cn; jian wang@fudan.edu.cn>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

an efﬁcient approach for data encoding and compression. In particular, a broad range of real-world applications fall into the ﬁeld of CS. Examples include magnetic resonance imaging (MRI) (Lustig et al., 2007), computed tomography (CT) (Chen et al., 2008), wireless channel estimation (Haupt et al., 2010), super-resolution (Fang et al., 2016), and singlepixel camera (Duarte et al., 2008).

Generally speaking, to reconstruct x from the ill-posed linear system (1), additional information is needed to ensure a unique solution. A widely adopted approach is to make use of the sparsity prior of natural signals in some transform domains. For example, natural sounds and images are often observed to be sparse under Fourier and wavelet transformation, respectively. Under the sparsity assumption, traditional CS seeks to ﬁnd out the sparsest signal x that ﬁts the measurements y:

min x 0, s.t. y = Ax + n, (2)

where x 0 := |{xi|xi = 0}| is the ℓ0-norm (number of non-zero elements) of x. It has been shown that if the sensing matrix A satisﬁes some conditions such as the restricted isometry property (RIP) or restricted eigenvalue condition (REC), CS guarantees robust recovery of k-sparse signals (Cand es & Tao, 2005). However, searching the sparsest solution to a linear system is known to be an NP-hard problem (Natarajan, 1995), for which no polynomial-time algorithm has been developed yet. In order to get a solution within feasible time, therefore, existing CS algorithms often employ greedy search principles, or make some relaxations to problem (2). Representative examples include matching pursuit (MP)-type algorithms (Tropp & Gilbert, 2007; Chen et al., 1989; Needell & Tropp, 2009) and optimization-based algorithms (Chen et al., 2001; Cand es & Tao, 2005).

Although traditional CS algorithms have achieved great success in many applications, due to the fact that most natural signals are not strictly sparse in given transform domains, relying on sparsity as the sole prior for reconstruction could lead to inaccurate results. Fortunately, natural signals often possess a variety of features besides sparsity (Kim et al., 2020). Hence, an alternate for CS reconstruction is to combine sparsity with additional reﬁned priors of signals, such as low-rank assumption (Fazel et al., 2008), total variation (Candes et al., 2006), and dictionary models (Shen et al., 2015). Despite their progress in both practical appli-

Uncertainty Modeling in Generative Compressed Sensing

Figure 1. Comparative illustration between Lasso, CSGM, Sparse-Gen and the proposed CS-BGM method.

cations and theoretical guarantees, these hand-crafted priors can only cover a limited range of natural signals, which are hard to generalize to signals from various sources.

To address this generalization issue, Bora et al. (2017) proposed to perform CS using generative models (CSGM). In a nutshell, CSGM uses a generator, which is a pre-trained unsupervised generative model (e.g., variational autoencoder (VAE) (Kingma & Welling, 2014) or generative adversarial network (GAN) (Goodfellow et al., 2014)), as the distributional prior of the training data. Parameterized by θ, the generator g that maps a low-dimensional latent variable z Rk to a high-dimensional signal g(z; θ) Rn (k n) is trained to output signals resembling those in the training set. The pre-trained generator models the conditional distribution q(x|z, θ) = δD[x g(z; θ)], where δD[ ] is Dirac delta function, and z is typically set as standard Gaussian N(z; 0, I). Hence, q(x) := RR q(x|z, θ)p(θ)p(z)dθdz is called a generative prior.

While testing, CSGM ﬁxes θ to be the point estimation ˆθ obtained from training, and optimizes the latent variable z to produce an estimation ˆx = g(ˆz; ˆθ) that has the minimum ﬁtting error to the test measurements, i.e.,

ˆz = arg min z Ag(z; ˆθ) y 2 2. (3)

This optimization can be performed via gradient descent (GD) or other search methods. Empirically, using generative models leads to a substantial gain in the reconstruction accuracy over conventional CS algorithms like Lasso (Cand es & Tao, 2005), especially when the number of measurements is

small (Whang et al., 2021). Moreover, theoretical analyses have been established for CSGM; see (Hand & Voroninski, 2018; Kamath et al., 2020; Liu & Scarlett, 2020).

Following CSGM, plenty of studies have recently been developed to leverage the power of generative models in CS. On the one hand, much effort has been made to accelerate the optimization process in the test phase. Shah & Hegde (2018) proposed to use projected gradient descent (PGD) in the signal space to speed up convergence, for which provable guarantees were derived. Raj et al. (2019) showed that a network-based PGD (NPGD) is empirically faster. Wu et al. (2019) utilized meta-learning to learn the initialization of θ as well as the measurement matrix A, so that the test optimization can be completed within fewer GD iterations. Asim et al. (2020) used a ﬂow-based invertible model where z can be computed inversely from x.

On the other hand, it is reported in (Bora et al., 2017) that CSGM often presents inferior performance when the test signals lie outside the range R(g(z; θ)) := {g(z; θ)|z Rk} of the generator. Research has been done to ameliorate the reconstruction quality for these out-of-range signals. In (Dhar et al., 2018), an extra sparse deviation vector was introduced to expand the range of the generator (which is called Sparse-Gen). This approach has been extended by Yang et al. (2021) using non-convex ℓq-norm. Furthermore, it has been shown in (Kabkab et al., 2018; Kim et al., 2020) that better reconstruction performances can be achieved via training generative models in supervised fashions (i.e., with access to y). Due to the use of y, how-

Uncertainty Modeling in Generative Compressed Sensing

ever, the supervised modiﬁcations only work with a ﬁxed measurement matrix A. As a result, retraining is required when A changes. Besides, it was shown in (Daras et al., 2021) that optimizing the high-dimensional intermediate layer (in addition to the low-dimensional latent variable) helps improve the expressiveness of the generator. This framework of CSGM has also been applied to MRI, where posterior sampling via Langevin dynamics was employed to generate higher-quality reconstructions (Jalal et al., 2021).

In this paper, in order to effectively reconstruct those outof-range signals, we propose to expand the range of the generator in CSGM by modeling uncertainties in the network parameter θ. Indeed, when the generative models are trained using empirical loss minimization, the uncertainties emerge naturally because the ﬁnite training data are imperfect estimations of the signal space of interest. To perform reconstruction (i.e., inference of x), we use variational inference and maximum a posteriori (MAP) for the lowdimensional latent variable z and high-dimensional parameter θ, respectively. By doing so, the range of the generator is adjusted adaptively to embrace those out-of-range test signals. We thus refer to our method as CS with Bayesian generative model (CS-BGM). A comparative illustration between Lasso, CSGM, Sparse-Gen and the proposed method is given in Fig. 1. Our work will justify the necessity of adjusting the range of the generative model for fundamentally improving the reconstruction quality. Also, numerical experiments will show the superiority of the proposed method over the baselines.

Similar to (Bora et al., 2017), we only consider the case where the noise n is assumed Gaussian. Whereas, we also mention that the proposed method can be readily generalized to non-Gaussian cases by applying techniques in e.g., (Jalal et al., 2020; Whang et al., 2020).

2. Preliminaries

Before proceeding to introduce the proposed CS-BGM, we review several important notions and deﬁnitions. The following two properties constrain the eigenvalues of the measurement matrix, which are commonly adopted to establish recovery guarantees for the sparsity-based CS algorithms. Deﬁnition 2.1 (REC). A matrix A Rm n is said to satisfy the restricted eigenvalue condition (REC) of order K, provided there exists some constant γ > 0 such that

Ax 2 2 γ x 2 2 (4)

holds for all K-sparse vector x Rn. Deﬁnition 2.2 (RIP). A matrix A Rm n is said to satisfy the restricted isometry property (RIP) of order K, provided there exists some constant δ (0, 1) such that

(1 δ) x 2 2 Ax 2 2 (1 + δ) x 2 2 (5)

holds for all K-sparse vector x Rn. The minimum δ satisfying (5) is called restricted isometry constant (RIC), which is denoted as δK(A), or δK for notational simplicity.

While REC requires that all the eigenvalues of A A should be greater than γ, RIP further restricts them being inside the interval [1 δ, 1 + δ]. From another perspective, γK measures the distance between the space of all K-sparse vectors and the null space of A, and δK is an indicator of how Euclidean norms of the K-sparse vectors can be preserved under the transform A (Dhar et al., 2018). It is well-known that many random matrices satisfy the RIP and REC with high probability when the number m of measurements scales almost linearly with K. For example, it has been shown in (Cand es & Tao, 2005; Baraniuk et al., 2008) that a random Gaussian matrix A whose entries are drawn i.i.d. from N(Aij; 0, 1

m) obeys the RIP with order-K RIC δK with overwhelming probability if m = O( K

To generalize the recovery guarantees from the sparsitybased prior to the generative prior, Bora et al. (2017) introduced the set-restricted eigenvalue condition (S-REC).

Deﬁnition 2.3 (Bora et al. 2017, S-REC). Let S Rn be a set. For some parameters γ > 0 and ϵ > 0, a matrix A Rm n is said to satisfy the set-restricted eigenvalue condition S-REC(S, γ, ϵ), provided that

A(x1 x2) 2 γ x1 x2 2 ϵ, x1, x2 S. (6)

Compared to REC that is deﬁned with K-sparse vectors, S-REC can be applied to non-sparse vectors by choosing the set S properly. Thus, it can be used in the generative models by setting S as the range of the generator. Besides, there is an extra slack variable ϵ for achieving the generalization. Similar to the REC, S-REC can also be satisﬁed by common random matrices with overwhelming probability when m scales almost linearly with k, leading to the reconstruction guarantees for the generative model; see (Bora et al., 2017).

3. The CS-BGM Method

In this section, we will explain how CS-BGM expands the range of the generator of CSGM to reconstruct the out-ofrange signals. The key idea is to consider the uncertainties of the model parameters. We will also theoretically justify the necessity of expanding the range of generator.

3.1. A Bayesian View of Generative CS

We start with a Bayesian view of the generative CS, which makes it easier to increase the ﬂexibility of the generative model. The training stage of CS-BGM is similar to that of CSGM. Speciﬁcally, the generator is trained such that the generative prior q(x) resembles the data distribution p(x) of the training signals. For example, it has been shown in

Uncertainty Modeling in Generative Compressed Sensing

GAN (Goodfellow et al., 2014) that

ˆθ = arg min θ JSD q(x|θ; Xtr) p(x; Xtr) , (7)

where JSD is the Jensen Shannon divergence, and Xtr is the matrix collecting all the training signals. In fact, ˆθ is a point estimation p(θ; Xtr) p(θ; ˆθ) = δD[θ ˆθ] from the Bayesian view. As a result, the generative prior q(x; Xtr) = RR q(x|z, θ)p(θ; Xtr)p(z)dθdz q(x; ˆθ) serves as a good approximation of the training signal distribution p(x; Xtr).

In testing, after observing the test measurements y, we infer the conditional expectation of the signal through

E[x|y; Xtr] Z xq(x|y; Xtr)

= ZZ q(x|z, θ)p(z, θ|y; Xtr)dzdθdx

= Ep(z,θ|y;Xtr)[g(z; θ)]. (8)

where the last equality follows from q(x|z, θ) = δD[x g(z; θ)]. Thus, what remains is to infer the posterior

p(z, θ|y; Xtr) p(y|z, θ)p(θ; Xtr)p(z), (9)

where we have used the Bayes rule.

For CSGM, the test prior (i.e. training posterior) for θ is taken to be the point estimation p(θ; Xtr) δD[θ ˆθ], and z is taken to be its MAP estimation:

ˆz = arg max z p(z, θ|y; Xtr)

arg max z p(z, ˆθ|y; Xtr)

= arg min z log p(y|z; ˆθ) log p(z)

= arg min z Ag(z; ˆθ) y 2 2 + λz z 2 2, (10)

where λz is the relative weights for the prior of z. As a result, the conditional expectation (8) in this case is given by E[x|y; Xtr] Ep(z,θ|y;Xtr)[g(z; θ)] g(ˆz; ˆθ). Notice that there is an extra term λz z 2 2 compared with (3). In fact, precisely due to this extra term, an improvement on the performance of CSGM has been reported in (Bora et al., 2017), which well matches with our analysis. However, since the training set contains only ﬁnite samples from the signal space, directly using the point estimations ˆθ and ˆz will ignore the uncertainties in the test prior p(θ; Xtr) and posterior p(z, θ|y; Xtr). In particular, since the parameter ˆθ is deterministic after training, it will result in a ﬁxed range R(g(z; ˆθ)) of the generator, leading to inferior reconstruction quality of test signals outside the range.

Different from CSGM, therefore, we will take these uncertainties into consideration in the following. However, due

Algorithm 1 Alternate optimization for CS-BGM

Input: Sensing matrix A Rm n, alternation number L, iteration numbers Rz, Rθ, learning rates ηz, ηθ, pretrained model parameters ˆθ, measurements y Rn. Initialization: Initialize v randomly, θ ˆθ. repeat

v(0) v; for r from 1 to Rz do

v(r) v(r 1) + ηz ELBO v(r 1) using MC sampling; end for v v(Rz); θ(0) θ; for r from 1 to Rθ do

θ(r) θ(r 1) ηθ L(θ(r 1), y)

θ(r 1) using MC sam-

pling; end for θ θ(Rθ); until do L times Output: Image estimation Eq(z|v)[g(z; θ)].

to the high-dimensionality of θ and the non-linearity of the neural network, it is difﬁcult to directly infer, or even approximately infer p(θ; Xtr) during training. Nevertheless, considering that the training and test signals are sampled from the same signal space, their distribution should be empirically close to each other. Hence, with a mild modiﬁcation to ˆθ, we can hopefully ﬁnd a new generator whose range can well cover the test signals. Speciﬁcally, we equip θ with a multivariate Gaussian prior p(θ; Xtr) = N(θ; ˆθ, λI), where λ is a hyperparameter determined by the discrepancy between training and testing.

To infer the joint posterior of z and θ which are dependent of each other (cf. (9)), we propose to solve for them with alternating optimization. Moreover, we choose to infer z and θ using variational inference (VI) and MAP estimation, respectively. To be more speciﬁc, we perform the following two processes iteratively:

i) ﬁxing p(θ|y; Xtr) δ[θ θ] to be the MAP estimation, ﬁnd a surrogate variational posterior q(z|v) that minimizes the Kullback-Leibler (KL) divergence KL q(z|v) p(z|y; θ) , where v is the variational parameter to be optimized. It is well-known (see e.g., (Blei et al., 2017)) that minimizing this KL divergence above is equivalent to maximizing the evidence lower bound

ELBO := Eq(z|v) log p(y|z; θ) KL q(z|v) p(z) ; (11)

ii) ﬁxing q(z|v), ﬁnd a MAP estimator θ that maximizes the posterior p(θ|z, y; Xtr) p(y|z, θ)p(θ; Xtr). Or

Uncertainty Modeling in Generative Compressed Sensing

equivalently, minimize the loss

L(θ, y) := Eq(z|v) Ag(z; θ) y 2 2 + λθ θ ˆθ 2 2, (12) where λθ is the relative weight trading-off the data ﬁdelity and the prior of θ during testing.

It should be noted that the dimension of θ is much larger than z, so it would require a considerable number of Monte Carlo (MC) samples to perform VI on it. For practical consideration, we therefore choose MAP rather than VI. Indeed, since the reconstruction results are more sensitive to the generator parameters θ compared to the latent variable z, the posterior of θ will have much smaller variance, and thus can be well approximated by a MAP estimation. Finally, the signal estimated by CS-BGM is given by E[x|y; Xtr] Ep(z,θ|y;Xtr)[g(z; θ)] Eq(z|v)[g(z; θ)]. The alternating optimization strategy is summarized in Alg. 1.

Alternatively, one can use the MAP estimation (instead of VI) of z, i.e., (10), in the alternate optimization. Interestingly, this modiﬁcation leads to a variant of the proposed CS-BGM without MC sampling, thus shortening the reconstruction time. Experiments in Sec. 4 suggest that this variant also achieves a satisfactory performance.

So far, we have explained how CS-BGM expands the range of generator via uncertainty modeling to include the out-ofrange signals. We mention that adjusting the parameters of generative models has also been suggested in CS with untrained neural networks (Ulyanov et al., 2018). However, the proposed method mainly differs in two aspects. Firstly, this work focuses on CS using random measurements, while untrained neural networks were used in inverse problems such as denoising, super-resolution, and inpainting removal, for which a rough estimation of the original signal can be obtained directly from the downsampled or perturbed measurements. Secondly, the training process of the generative model provides a strong prior for θ in the proposed method. With this prior term, we are only allowed to modify θ slightly; cf. (12). Whereas in CS with untrained neural networks, there is no prior so that θ is randomly initialized and can change signiﬁcantly. In the next section, we will justify that expanding the range of the generative model is necessary for fundamentally reducing the reconstruction error.

3.2. Why Expanding the Range of Generator?

Although theoretical guarantees have been well developed for CSGM, empirical tests suggest that a large measurement error per pixel (i.e., 1

n Ag(ˆz; ˆθ) y 2 2) can still be observed on the test set even after optimizing over z (Bora et al., 2017); see also Fig. 2 for such an example. In the context of CS, to reduce the reconstruction error g(ˆz; ˆθ) x 2 2, there are two major approaches in general.

i) One is by adding additional priors of x to reﬁne the reconstruction results. As ˆz is already the minimizer of the measurement error in CSGM, these priors help to ﬁnd a z such that Ag(z ; ˆθ) y 2 2 Ag(ˆz; ˆθ) y 2 2 but g(z ; ˆθ) x 2 2 g(ˆz; ˆθ) x 2 2.

ii) The other is by expanding the range of reconstruction model to cover more signals. Ablation tests by Bora et al. (2017) showed that the representative capability of the generative model is mainly limited by its range, although no theoretical evidence has been available.

Our analysis is motivated from these two approaches. We will show theoretically that as long as the measurement error is large, the reconstruction error can never be signiﬁcantly reduced, no matter what additional prior is imposed. This is because the reconstruction error is essentially lower bounded by its measurement error. Therefore, to fundamentally reduce the reconstruction error, breaking the range limit of the generative model and thus enhancing its representative capability is necessary.

To formalize our analysis, we ﬁrst give a useful framework called the set-restricted isometry property (S-RIP).

Deﬁnition 3.1 (S-RIP). Let S Rn be a set. For some parameters δ (0, 1) and ϵ > 0, a matrix A Rm n is said to satisfy the S-RIP(S, δ, ϵ), provided that

1 δ x1 x2 2 ϵ A(x1 x2) 2

1+δ x1 x2 2+ϵ, x1, x2 S.

While the deﬁnition of S-RIP resembles that of S-REC, it further requires the measurement distance A(x1 x2) 2 to be upper bounded by the signal distance x1 x2 2. Precisely, the upper bound is of vital importance to our analysis.

Next, we show that with high probability, the S-RIP can be satisﬁed by random Gaussian matrices.

Lemma 3.2. Let g : Rk 7 Rn be L-Lipschitz. Let Bk(r) = {z|z Rk, z 2 r} be an ℓ2-norm ball in Rk. And denote by g(Bk(r)) := {g(z; θ)|z Bk(r)} the range of the generator restricted to inputs from the ℓ2-norm ball. For δ (0, 1), if

then a random matrix A Rm n with i.i.d. entries N(Aij; 0, 1

m) satisﬁes the S-RIP(g(Bk(r)), δ, ϵ) with probability exceeding 1 e Ω(δ2m).

The proofs are left to the appendices. This lemma suggests that, with m increasing almost linearly in k, the S-RIP can be satisﬁed with overwhelming probability in g(Bk(r)). It

Uncertainty Modeling in Generative Compressed Sensing

Figure 2. An example of reconstructing an RGB face image of 64 64 3, where CS-BGM can effectively reduce the measurement and reconstruction errors compared to CSGM.

should be noted that although the Lipschitz constant L of the generator varies for different network architecture, its effect is much weaker than that of k due to the logarithm term. We also remark that in practice, z could hardly take values corresponding to tiny prior probabilities, otherwise the regularization terms in Eqs. (10) and (11) would become too large. Empirically, when those z are given as inputs, the decoder outputs can be interpreted as extrapolation , which are of inferior qualities compared to interpolation ones. Therefore, z is generally required to have a bounded norm r; see the rationale in (Bora et al., 2017).

Then, with the aid of Lemma 3.2, we have the following theorem, which applies to the entire range of the generator.

Theorem 3.3. Let g : Rk 7 Rn be a d-layer neural network, where each layer is composed by a linear transform and a point-wise non-linearity. Also, denote by R(g(z; θ)) := {g(z; θ)|z Rk} the range of the generator. Suppose that there are at most c nodes per layer, and all the non-linearities are piecewise linear with at most two pieces. Then, for δ (0, 1), if

δ2 log c , (14)

a random matrix A Rm n with i.i.d. entries N(Aij; 0, 1

m) satisﬁes the S-RIP(R(g(z; θ)), δ, ϵ) with probability exceeding 1 e Ω(δ2m).

Finally, we show that a necessary condition for a generator to cover the test signal x is that a small measurement error can be achieved by optimizing over z. Consider an ideal generator g(z; θ ) such that minz g(z; θ ) x = 0 with minimizer denoted by z . Then, we know from Theorem 3.3 that if m = Ω kd

δ2 log c , the Gaussian measurement matrix A with i.i.d. entries N(Aij; 0, 1

m) satisﬁes the

S-RIP(R(g(z; θ )), δ, ϵ) with probability of 1 e Ω(δ2m). Also, let ˆz = arg minz Ag(z; θ ) y 2 2, then we have

Ag(ˆz; θ ) y 2 Ag(z ; θ ) y 2 = Ag(z ; θ ) (Ax + n) 2 A(g(z ; θ ) x) 2 + n 2

1 + δ g(z ; θ ) x 2 + ϵ + n 2 = ϵ + n 2 (15)

with overwhelming probability.1 In other words, the reconstruction error is essentially lower bounded by its measurement error, which completes the justiﬁcation.

We mention that our analysis also explains Fig. 2. In CSGM, the minimum measurement error is still large, suggesting that the generator cannot cover the test signal, hence the reconstruction error cannot be small. Whereas for CS-BGM, the measurement error is adequately small so that a small reconstruction error can be possibly achieved.

4. Experiments

In this section, we will state the setup of the numerical experiments and evaluate the performance of our method. Comparisons among CS-BGM, LASSO, CSGM, PGD-GAN, and Sparse-Gen will be presented to empirically appraise the recovery capabilities. All experiments are run using Tensorflow (Abadi et al., 2015) on one Intel(R) Xeon(R) Silver 4116 CPU and four Ge Force RTX 2080 Ti GPUs. For convenient reproducibility, our codes are available at https://github.com/347325285/CS_BGM.

4.1. Datasets

We consider two datasets: i) the MNIST handwritten digit dataset (Le Cun & Cortes, 2010) and ii) the Celeb Faces Attributes (Celeb A) dataset (Liu et al., 2015). For the MNIST dataset, each digit is a grayscale image of size 28 28, where the value of each pixel is 0 or 1. For the Celeb A datase, the dimension of each RGB image is cropped to 64 64 3 for consistent use. In this case, the input of MNIST is 784dimensional and sparse. Whereas for Celeb A, the input is as high as 12288-dimensional and dense.

4.2. Models and Hyperparameters

To ensure that the measurement matrix satisﬁes the S-RIP with high probability, we consider the random Gaussian measurement matrix with each entry i.i.d. drawn from N(Aij; 0, 1

m). We use pre-trained VAE models as in (Bora et al., 2017) on the MNIST dataset. The VAE model is a fully connected neural network with the architecture 20-500-

1This doesn t holds with the S-REC, because the latter fails to lower bound the recovery error with the measurement error.

Uncertainty Modeling in Generative Compressed Sensing

(a) MNIST dataset

(b) Celeb A dataset

Figure 3. Comparative results among Lasso, CSGM, Sparse-Gen and CS-BGM. The vertical bars are 95% conﬁdence intervals.

500-784, i.e., we generate the digit of size 28 28 from a standard normal vector z R20. For the experiments on Celeb A dataset, we use the pre-trained generative model as in (Bora et al., 2017). The latent representation space for z is of dimension 100. For both models, we use Adam optimizer (Kingma & Ba, 2015) with learning rate 0.001 for v and 0.002 for θ, respectively. In the fast implementation named CS-BGM (w/o MC), which infers z with MAP estimation like θ, the learning rate for z is set to be 0.001l. The regular coefﬁcient λθ is 0.1. To compare the performance of different methods, we choose 1024 digits from the MNIST test dataset and 64 images from Celeb A, with a batch size of 64. For the implementations of other comparative algorithms, we use hyperparameters suggested by their authors.

4.3. Optimization Strategy

Here, we detail the optimization strategy of the proposed CSBGM algorithm. Following the common choice, we use a multivariate Gaussian distribution with diagonal covariance matrix as the variational posterior q(z|v), and adopt the reparameterization trick (Kingma & Welling, 2014) when performing VI. The number of MC samples for inference is set to 20 on the MNIST dataset and 10 on the celeb A dataset. In our experiments, we perform the optimization of z for 2000 iterations and then θ for 500 iterations with only 1 alternations. In particular, we set J = 1, K = 2000, and L = 500 to Alg. 1. In the fast implementation, we set J = 1, K = 500, and L = 200 to Alg. 1. We do not perform more alternations or iterations since the loss has already converged under such setting. We stress that optimization with more alternations for different m could produce better performance under an elaborate parameter selection. For the sake of brevity, however, we do not include such ﬁne-tuning

in our experiments.

4.4. Results

We present our experimental results on the MNIST and Celeb A datasets, respectively. For comparison, we choose the per-pixel ℓ2 reconstruction loss ( 1

n x ˆx 2) as our performance metric.

4.4.1. MNIST

According to the previous hyperparameter settings, we construct the corresponding measurement matrix A of size m 784 (with m varying from 50 to 750) and compare the reconstructed pixel error between different methods in Fig. 3(a) and Fig. 4. Overall, it can be observed that our proposed method performs the best (i.e., with the lowest ℓ2 loss) when the number m of measurements is between 100 and 400. When m is 500 or larger, the performance of Sparse-Gen improves and becomes comparable to that of CS-BGM. This is perhaps because Sparse-Gen optimizes the following problem:

min z,ν Dν 1 + A(g(z; θ) + ν) y 2 2, (16)

where D is some transform basis, which is set as an identity matrix in MNIST. Thus, when m becomes larger, the assumption made by Sparse-Gen (i.e., ν being sparse) is more likely to holds.

4.4.2. CELEBA

We test all methods for A Rm 12288 with m varying from 20 to 104. As shown in Fig. 3(b), the CS-BGM method has the minimum reconstruction error for all tested region of m. The gap between CS-BGM and CSGM even increases with

Uncertainty Modeling in Generative Compressed Sensing

Figure 4. Reconstruction results on MNIST dataset when the measurement number m = 100.

m, which clearly demonstrates the superiority of expanding the range of generator.

Fig. 5 shows the comparative reconstruction results of face images among different methods on Celeb A dataset. Since it has been reported by Dhar et al. (2018) that Sparse-Gen performs better with discrete wavelet transform (DWT) than discrete cosine transform (DCT), we run Sparse-Gen with DWT in our experiments. The number m of measurements is set to 1000. One can observe that Lasso works poorly in most cases, since the face images (after DCT or DWT) are not sparse enough for accurate reconstruction.

One can also observe that the faces recovered by other methods (CSGM, PGD-GAN, and Sparse-Gen) are more or less deformed or distorted. The main reason is that the distribution of the training images are inconsistent with that of the test ones, so that the latter cannot be reconstructed with the learned prior. Indeed, most of these distorted faces are combinations of the elements from the training images, while they are not similar in detail to the test ones. This observation conﬁrms the inconsistency of the distributions. In addition, the residual (i.e. x ˆx) is not strictly sparse under the hand-crafted transform domain, and hence could not be well modeled using a sparse ν. In comparison, CS-BGM can accurately restore the details of the test images, as it can adaptively adjust the range of the generator to promote the reconstruction quality.

5. Conclusion and Future Work

In this paper, we have proposed a method called CS-BGM for fundamentally reducing the reconstruction error of signals in CS with generative models. Our analysis has shown that the quality of reconstructed images is limited by the measurement error, regardless of what additional prior is imposed. Nevertheless, by introducing uncertainties, CS-BGM can effectively break through the range limit of representative capability for the generator, thus offering big potential for quality improvement in generative image reconstruction. Empirically experiments on common datasets have demonstrated that the performance gains brought by expanding the range of the generator are indeed nontrivial.

Our future work will focus on several feasible directions. First, we will try different priors (e.g., sparsity) on θ in CSBGM, or learn a prior for θ from the training data. Second, as mentioned, the upper bound of the reconstruction error is dominated by the slack constant ϵ when the measurement error is adequately small. To further reﬁne the reconstruction results, therefore, attention will be cast to the reduction of ϵ, with possible use of additional priors on the signal x. Third, we will try to generalize the proposed method to larger and more sophisticated models such as Style GAN (Karras et al., 2019). Since the complexity of CS-BGM is dominated by the alternating optimization between z and θ, a potential solution is to train a neural network (e.g., LSTM) to infer their joint posterior given y. Thus, the burdensome optimization

Uncertainty Modeling in Generative Compressed Sensing

Figure 5. Reconstruction results on Celeb A dataset when the measurement number m = 1000. Lasso (C) and Lasso (W) mean that we perform Lasso on the DCT and DWT, respectively. Sparse-Gen is performed on the wavelet basis as well.

process can be substituted by a simple forward mapping.

Acknowledgements

Shanghai Municipal Science and Technology Major Project (2018SHZDZX01); National Natural Science Foundation of China (12001109, 92046021, 61971146); Science and Technology Commission of Shanghai Municipality grant 20dz1200600; ZJ Lab and Shanghai Center for Brain Science and Brain-Inspired Technology; Innovation Cross and Cooperation Team Project of Chinese Academy of Sciences (JCTD-2020-15).

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Lev-

enberg, J., Man e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Tensor Flow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorﬂow.org.

Asim, M., Daniels, M., Leong, O., Ahmed, A., and Hand, P. Invertible generative models for inverse problems: mitigating representation error and dataset bias. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 399 409. PMLR, 13 18 Jul 2020.

Baraniuk, R., Davenport, M., De Vore, R., and Wakin, M. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 28(3): 253 263, 2008. Communicated by Emmanuel J. Cand es.

Uncertainty Modeling in Generative Compressed Sensing

Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017.

Bora, A., Jalal, A., Price, E., and Dimakis, A. G. Compressed sensing using generative models. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 537 546. PMLR, 06 11 Aug 2017.

Cand es, E. and Tao, T. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203 4215, 2005.

Candes, E., Romberg, J., and Tao, T. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489 509, 2006.

Chen, G.-H., Tang, J., and Leng, S. Prior image constrained compressed sensing (piccs): A method to accurately reconstruct dynamic ct images from highly undersampled projection data sets. Medical Physics, 35(2):660 663, 2008.

Chen, S., Billings, S. A., and Luo, W. Orthogonal least squares methods and their application to non-linear system identiﬁcation. International Journal of Control, 50 (5):1873 1896, 1989.

Chen, S. S., Donoho, D. L., and Saunders, M. A. Atomic decomposition by basis pursuit. SIAM Review, 43(1): 129 159, 2001.

Daras, G., Dean, J., Jalal, A., and Dimakis, A. Intermediate layer optimization for inverse problems using deep generative models. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 2421 2432. PMLR, 18 24 Jul 2021.

Dhar, M., Grover, A., and Ermon, S. Modeling sparse deviations for compressed sensing using generative models. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1214 1223. PMLR, 10 15 Jul 2018.

Duarte, M. F., Davenport, M. A., Takhar, D., Laska, J. N., Sun, T., Kelly, K. F., and Baraniuk, R. G. Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine, 25(2):83 91, 2008.

Fang, J., Wang, F., Shen, Y., Li, H., and Blum, R. S. Superresolution compressed sensing for line spectral estimation: An iterative reweighted approach. IEEE Transactions on Signal Processing, 64(18):4649 4662, 2016.

Fazel, M., Candes, E., Recht, B., and Parrilo, P. Compressed sensing and robust recovery of low rank matrices. In 2008 42nd Asilomar Conference on Signals, Systems and Computers, pp. 1043 1047, 2008.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.

Hand, P. and Voroninski, V. Global guarantees for enforcing deep generative priors by empirical risk. In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 970 978. PMLR, 06 09 Jul 2018.

Haupt, J., Bajwa, W. U., Raz, G., and Nowak, R. Toeplitz compressed sensing matrices with applications to sparse channel estimation. IEEE Transactions on Information Theory, 56(11):5862 5875, 2010.

Jalal, A., Liu, L., Dimakis, A. G., and Caramanis, C. Robust compressed sensing using generative models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 713 727. Curran Associates, Inc., 2020.

Jalal, A., Arvinte, M., Daras, G., Price, E., Dimakis, A. G., and Tamir, J. Robust compressed sensing mri with deep generative priors. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 14938 14954. Curran Associates, Inc., 2021. URL https://proceedings. neurips.cc/paper/2021/file/ 7d6044e95a16761171b130dcb476a43e-Paper. pdf.

Kabkab, M., Samangouei, P., and Chellappa, R. Task-aware compressed sensing with generative adversarial networks. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Kamath, A., Price, E., and Karmalkar, S. On the power of compressed sensing with generative models. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5101 5109. PMLR, 13 18 Jul 2020.

Uncertainty Modeling in Generative Compressed Sensing

Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.

Kim, K.-S., Lee, J. H., and Yang, E. Compressed sensing via measurement-conditional generative models. ar Xiv preprint ar Xiv:2007.00873, 2020.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR, 2014.

Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000.

Le Cun, Y. and Cortes, C. MNIST handwritten digit database. 2010.

Liu, Z. and Scarlett, J. Information-theoretic lower bounds for compressive sensing with generative models. IEEE Journal on Selected Areas in Information Theory, 1(1): 292 303, 2020.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

Lustig, M., Donoho, D., and Pauly, J. M. Sparse mri: The application of compressed sensing for rapid mr imaging. Magnetic Resonance in Medicine, 58(6):1182 1195, 2007.

Natarajan, B. K. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227 234, 1995.

Needell, D. and Tropp, J. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301 321, 2009.

Raj, A., Li, Y., and Bresler, Y. Gan-based projector for faster recovery with convergence guarantees in linear inverse problems. In Proceedings of International Conference on Computer Vision (ICCV), October 2019.

Shah, V. and Hegde, C. Solving linear inverse problems using gan priors: An algorithm with provable guarantees. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4609 4613, 2018.

Shen, Y., Li, J., Zhu, Z., Cao, W., and Song, Y. Image reconstruction algorithm from compressed sensing measurements by dictionary learning. Neurocomputing, 151: 1153 1162, 2015.

Tropp, J. A. and Gilbert, A. C. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12):4655 4666, 2007.

Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

Whang, J., Lei, Q., and Dimakis, A. Compressed sensing with invertible generative models and dependent noise. In Neur IPS 2020 Workshop on Deep Learning and Inverse Problems, 2020.

Whang, J., Lei, Q., and Dimakis, A. Solving inverse problems with a ﬂow-based noise model. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 11146 11157. PMLR, 18 24 Jul 2021.

Wu, Y., Rosca, M., and Lillicrap, T. Deep compressed sensing. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 6850 6860. PMLR, 09 15 Jun 2019.

Yang, Y., Wang, H., Qiu, H., Wang, J., and Wang, Y. Nonconvex sparse deviation modeling via generative models. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2345 2349, 2021.

Uncertainty Modeling in Generative Compressed Sensing

A. Proof of Lemma 3.2

Lemma A.1 (Restated). Let g : Rk 7 Rn be L-Lipschitz. Let Bk(r) = {z|z Rk, z 2 r} be an ℓ2-norm ball in Rk. And denote by g(Bk(r)) := {g(z; θ)|z Bk(r)} the range of the generator restricted to inputs from the ℓ2-norm ball. For δ (0, 1), if

then a random matrix A Rm n with i.i.d. entries N(Aij; 0, 1

m) satisﬁes the S-RIP(g(Bk(r)), δ, ϵ) with probability exceeding 1 e Ω(δ2m).

Proof. The proof extends that of (Bora et al., 2017, Lemma 4.1) using Laurent-Massart bounds and Johnson Lindenstrauss lemma.

First, Let b be a vector such that

bi = m x 2 (Ax)i, i = 1, . . . , m.

Since the entries of A are i.i.d. Gaussian N(Aij; 0, 1

m), it is straightforward to verify that

bi N(0, 1).

Thus we have b 2 2 χ2(m).

Using the concentration of measure for chi-squared distribution (Laurent & Massart, 2000), we obtain

P[ b 2 2 m 2

mt + 2t] e t (18)

and P[ b 2 2 m 2

mt] e t. (19)

Substituting 2

mt + 2t with δm in (18), and 2

mt with δm in (19), respectively, we get

P[ b 2 2 (1 + δ)m] e m

2 (δ+1+ 2δ+1)

and P[ b 2 2 (1 δ)m] e δ2m

Notice that

Ax 2 2 = x 2 2 m b 2 2,

by the union bound we have for a ﬁxed x Rn that

P[(1 δ) x 2 2 Ax 2 2 (1 + δ) x 2 2] 1 e m

2 (δ+1+ 2δ+1) e δ2m

=1 e Ω(δm) e Ω(δ2m)

=1 e Ω(δ2m), (20)

where the last equation is because δ (0, 1).

Then, we construct an ϵ

L-net N on Bk(r) such that

log |N| k log 4Lr

Since g is L-Lipschitz, we know that g(N) is an ϵ-net of g(Bk(r)).

Uncertainty Modeling in Generative Compressed Sensing

Let T := {g(z1) g(z2)|z1, z2 N} be the set of pairwise differences for vectors in g(N). Then we have

log |T| log |N|2 2k log 4Lr

For any z, z Bk(r), there exists z1, z2 N such that g(z1), g(z2) are ϵ-close to g(z), g(z ), respectively. Therefore we obtain

g(z) g(z ) 2 g(z1) g(z2) 2 + g(z) g(z1) 2 + g(z2) g(z ) 2 g(z1) g(z2) 2 + 2ϵ (22)

g(z) g(z ) 2 g(z1) g(z2) 2 g(z) g(z1) 2 g(z2) g(z ) 2 g(z1) g(z2) 2 2ϵ. (23)

Employing (Bora et al., 2017, Lemma 8.2), with probability of 1 e Ω(m), we get

Ag(z1; θ) Ag(z; θ) 2 = O(ϵ)

and Ag(z2; θ) Ag(z ; θ) 2 = O(ϵ).

Therefore, with probability of 1 e Ω(m), we have

Ag(z) Ag(z ) 2 Ag(z1) Ag(z2) 2 + Ag(z) Ag(z1) 2 + Ag(z2) Ag(z ) 2 Ag(z1) Ag(z2) 2 + O(ϵ) (24)

and Ag(z) Ag(z ) 2 Ag(z1) Ag(z2) 2 O(ϵ). (25)

Substituting x in (20) with g(z1) g(z2), it then holds with probability of 1 e Ω(δ2m) that

(1 δ) g(z1) g(z2) 2 2 Ag(z1) Ag(z2) 2 2 (1 + δ) g(z1) g(z2) 2 2 (26)

for a ﬁxed g(z1) g(z2).

By the union bound over all the vectors in T, we have (26) holds for z1, z2 N with probability of

1 2|T|e Ω(δ2m) = 1 2e Ω(δ2m)

where the equality follows from (13) and (21).

Since (24), (25) and (26) are independent with each other, then with probability of

(1 e Ω(m))(1 e Ω(δ2m)) = 1 e Ω(δ2m),

it holds that

Ag(z) Ag(z ) 2 Ag(z1) Ag(z2) 2 + O(ϵ)

1 + δ g(z1) g(z2) 2 + O(ϵ)

Ag(z) Ag(z ) 2 Ag(z1) Ag(z2) 2 O(ϵ)

1 + δ g(z1) g(z2) 2 O(ϵ),

which is the desired S-RIP(g(Bk(r)), δ, ϵ).

Uncertainty Modeling in Generative Compressed Sensing

B. Proof of Theorem 3.3

Theorem B.1 (Restated). Let g : Rk 7 Rn be a d-layer neural network, where each layer is composed by a linear transform and a point-wise non-linearity. And denote by R(g(z; θ)) := {g(z; θ)|z Rk} the range of the generator. Suppose there are at most c nodes per layer, and all the non-linearities are piecewise linear with at most two pieces. For δ (0, 1), if

δ2 log c , (27)

then a random matrix A Rm n with i.i.d. entries N(Aij; 0, 1

m) satisﬁes the S-RIP(R(g(z; θ)), δ, ϵ) with probability of 1 e Ω(δ2m).

Proof. From (Bora et al., 2017, Section 8.3) we know that R(g(z; θ)) is a union of O(ckd) different k-dimensional faces (k-faces) in Rn.

For each k-face F R(g(z; θ)), applying Lemma 3.2 (which still holds for any compact set in addition to ℓ2-norm balls), we get that A satisﬁes S-RIP(F, δ, ϵ) with probability of 1 e Ω(δ2m) if

Then, for the range R(g(z; θ)) which is a union of O(ckd) different k-faces, the union bound gives that A satisﬁes S-RIP(F, δ, ϵ) with probability exceeding 1 O(ckd)e Ω(δ2m) if

Therefore, let

then A can satisfy S-RIP(F, δ, ϵ) with probability of 1 e Ω(δ2m). The proof is thus completed.