# gradient_step_denoiser_for_convergent_plugandplay__d2a2389c.pdf

Published as a conference paper at ICLR 2022

GRADIENT STEP DENOISER FOR CONVERGENT PLUGAND-PLAY

Samuel Hurault , Arthur Leclaire & Nicolas Papadakis Univ. Bordeaux, Bordeaux INP, CNRS, IMB, UMR 5251,F-33400 Talence, France

Plug-and-Play (Pn P) methods constitute a class of iterative algorithms for imaging problems where regularization is performed by an off-the-shelf denoiser. Although Pn P methods can lead to tremendous visual performance for various image problems, the few existing convergence guarantees are based on unrealistic (or suboptimal) hypotheses on the denoiser, or limited to strongly convex data-ﬁdelity terms. We propose a new type of Pn P method, based on half-quadratic splitting, for which the denoiser is realized as a gradient descent step on a functional parameterized by a deep neural network. Exploiting convergence results for proximal gradient descent algorithms in the nonconvex setting, we show that the proposed Pn P algorithm is a convergent iterative scheme that targets stationary points of an explicit global functional. Besides, experiments show that it is possible to learn such a deep denoiser while not compromising the performance in comparison to other state-of-the-art deep denoisers used in Pn P schemes. We apply our proximal gradient algorithm to various ill-posed inverse problems, e.g. deblurring, superresolution and inpainting. For all these applications, numerical results empirically conﬁrm the convergence results. Experiments also show that this new algorithm reaches state-of-the-art performance, both quantitatively and qualitatively.

1 INTRODUCTION

Image restoration (IR) problems can be formulated as inverse problems of the form

x arg min x f(x) + λg(x) (1)

where f is a term measuring the ﬁdelity to a degraded observation y, and g is a regularization term weighted by a parameter λ 0. Generally, the degradation of a clean image ˆx can be modeled by a linear operation y = Aˆx + ξ, where A is a degradation matrix and ξ a white Gaussian noise. In this context, the maximum a posteriori (MAP) derivation relates the data-ﬁdelity term to the likelihood f(x) = log p(y|x) = 1 2σ2 ||Ax y||2, while the regularization term is related to the chosen prior.

Regularization is crucial since it tackles the ill-posedness of the IR task by bringing a priori knowledge on the solution. A lot of research has been dedicated to designing accurate priors g. Among the most classical priors, one can single out total variation (Rudin et al., 1992), wavelet sparsity (Mallat, 2009) or patch-based Gaussian mixtures (Zoran & Weiss, 2011). Designing a relevant prior g is a difﬁcult task and recent approaches rather apply deep learning techniques to directly learn a prior from a database of clean images (Lunz et al., 2018; Prost et al., 2021; Gonz alez et al., 2021).

Generally, the problem (1) does not have a closed-form solution, and an optimization algorithm is required. First-order proximal splitting algorithms (Combettes & Pesquet, 2011) operate individually on f and g via the proximity operator

Proxf(x) = arg min z 1 2||x z||2 + f(z). (2)

Among them, half-quadratic splitting (HQS) (Geman & Yang, 1995) alternately applies the proximal operators of f and g. Proximal methods are particularly useful when either f or g is nonsmooth.

Plug-and-Play (Pn P) methods (Venkatakrishnan et al., 2013) build on proximal splitting algorithms by replacing the proximity operator of g with a generic denoiser, e.g. a pretrained deep network.

Corresponding author: samuel.hurault@math.u-bordeaux.fr

Published as a conference paper at ICLR 2022

These methods achieve state-of-the-art results (Buzzard et al., 2018; Ahmad et al., 2020; Yuan et al., 2020; Zhang et al., 2021) in various IR problems. However, since a generic denoiser cannot generally be expressed as a proximal mapping (Moreau, 1965), convergence results, which stem from the properties of the proximal operator, are difﬁcult to obtain. Moreover, the regularizer g is only made implicit via the denoising operation. Therefore, Pn P algorithms do not seek the minimization of an explicit objective functional which strongly limits their interpretation and numerical control.

In order to keep tractability of a minimization problem, Romano et al. (2017) proposed, with regularization by denoising (RED), an explicit prior g that exploits a given generic denoiser D in the form g(x) = 1

2 x, x D(x) . With strong assumptions on the denoiser (in particular a symmetric Jacobian assumption), they show that it veriﬁes xg(x) = x D(x). (3) Such a denoiser is then plugged in gradient-based minimization schemes. Despite having shown very good results on various image restoration tasks, as later pointed out by Reehorst & Schniter (2018) or Saremi (2019), existing deep denoisers lack Jacobian symmetry. Hence, RED does not minimize an explicit functional and is not guaranteed to converge.

Contributions. In this work, we develop a Pn P scheme with novel theoretical convergence guarantees and state-of-the-art IR performance. Departing from the Pn P-HQS framework, we plug a denoiser that inherently satisﬁes equation (3) without sacriﬁcing the denoising performance. The resulting ﬁxed-point algorithm is guaranteed to converge to a stationary point of an explicit functional. This convergence guarantee does not require strong convexity of the data-ﬁdelity term, thus encompassing ill-posed IR tasks like deblurring, super-resolution or inpainting.

2 RELATED WORKS

Pn P methods have been successfully applied in the literature with various splitting schemes: HQS (Zhang et al., 2017b; 2021), ADMM (Romano et al., 2017; Ryu et al., 2019), Proximal Gradient Descent (PGD) (Terris et al., 2020). First used with classical non deep denoisers such as BM3D (Chan et al., 2016) and pseudo-linear denoisers (Nair et al., 2021; Gavaskar et al., 2021), more recent Pn P approaches (Meinhardt et al., 2017; Ryu et al., 2019) rely on efﬁcient off-theshelf deep denoisers such as Dn CNN (Zhang et al., 2017a). State-of-the-art IR results are currently obtained with denoisers that are speciﬁcally designed to be integrated in Pn P schemes, like IRCNN (Zhang et al., 2017b) or DRUNET (Zhang et al., 2021). Though providing excellent restorations, such schemes are not guaranteed to converge for all kinds of denoisers or IR tasks.

Designing convergence proofs for Pn P algorithms is an active research topic. Sreehari et al. (2016) used the proximal theorem of Moreau (Moreau, 1965) to give sufﬁcient conditions for the denoiser to be an explicit proximal map, which are applied to a pseudo-linear denoiser. The convergence with pseudo-linear denoisers have been extensively studied (Gavaskar & Chaudhury, 2020; Nair et al., 2021; Chan, 2019). However, state-of-the-art Pn P results are obtained with deep denoisers. Various assumptions have been made to ensure the convergence of the related Pn P schemes. With a bounded denoiser assumption, Chan et al. (2016); Gavaskar & Chaudhury (2019) showed convergence of Pn P-ADMM with stepsizes decreasing to 0. RED (Romano et al., 2017) and RED-PRO (Cohen et al., 2021) respectively consider the classes of denoisers with symmetric Jacobian or demicontractive mappings, but these conditions are either too restrictive or hard to verify in practice. In Appendix A.3, more details are given on RED-based methods. Many works focus on Lipschitz properties of Pn P operators. Depending on the splitting algorithm in use, convergence can be obtained by assuming the denoiser averaged (Sun et al., 2019b), ﬁrmly nonexpansive (Sun et al., 2021; Terris et al., 2020) or simply nonexpansive (Reehorst & Schniter, 2018; Liu et al., 2021). These settings are unrealistic as deep denoisers do not generally satisfy such properties. Ryu et al. (2019); Terris et al. (2020) propose different ways to train deep denoisers with constrained Lipschitz constants, in order to ﬁt the technical properties required for convergence. But imposing hard Lipschitz constraints on the network alters its denoising performance (Bohra et al., 2021; Hertrich et al., 2021). Yet, Ryu et al. (2019) manages to get a convergent Pn P scheme without assuming the nonexpansiveness of D. This comes at the cost of imposing strong convexity on the data-ﬁdelity term f, which excludes many IR tasks like deblurring, super-resolution or inpainting. Hence, given the ill-posedness of IR problems, looking for a unique solution via contractive operators is a restrictive assumption. In this work, we do not impose contractiveness, but still obtain convergence results with realistic hypotheses.

Published as a conference paper at ICLR 2022

One can relate the ideal deep denoiser to the true natural image prior p via Tweedie s Identity. In (Efron, 2011), it is indeed shown that the Minimum Mean Square Error (MMSE) denoiser D σ (at noise level σ) veriﬁes Dσ(x) = x + σ2 x log pσ(x) where pσ is the convolution of p with the density of N(0, σ2 Id). In a recent line of research (Bigdeli et al., 2017; Xu et al., 2020; Laumont et al., 2021; Kadkhodaie & Simoncelli, 2020), this relation is used to plug a denoiser in gradient-based dynamics. In practice, the MMSE denoiser cannot be computed explicitly and Tweedie s Identity does not hold for deep approximations of the MMSE. In order to be as exhaustive as possible, we detailed the addressed limitations of existing Pn P methods in Appendix A.1.

3 THE GRADIENT STEP PLUG-AND-PLAY

The proposed method is based on the Pn P version of half-quadratic-splitting (Pn P-HQS) that amounts to replacing the proximity operator of the prior g with an off-the-shelf denoiser Dσ. In order to deﬁne a convergent Pn P scheme, we ﬁrst set up in Section 3.1 a Gradient Step (GS) denoiser. We then introduce the Gradient Step Pn P (GS-Pn P) algorithm in Section 3.2.

3.1 GRADIENT STEP DENOISER

We propose to plug a denoising operator Dσ that takes the form of a gradient descent step

Dσ = Id gσ, (4)

with gσ : Rn R. Contrary to Romano et al. (2017), our denoiser exactly represents a conservative vector ﬁeld. The choice of the parameterization of gσ is fundamental for the denoising performance. As already noticed in Salimans & Ho (2021), we experimentally found that directly modeling gσ as a neural network (e.g. a standard network used for classiﬁcation) leads to poor denoising performance. In order to keep the strength of state-of-the-art unconstrained denoisers, we rather use

2||x Nσ(x)||2, (5)

which leads to Dσ(x) = x gσ(x) = Nσ(x) + JNσ(x)T (x Nσ(x)), (6)

where Nσ : Rn Rn is parameterized by a neural network and JNσ(x) is the Jacobian of Nσ at point x. As discussed in Appendix A.2, the formulation (5) for gσ has been proposed in (Romano et al., 2017, Section 5.2) and (Bigdeli & Zwicker, 2017) for a distinct but related purpose, and not exploited for convergence analysis. Thanks to our deﬁnition (6) for Dσ, we can parameterize Nσ with any differentiable neural network architecture Rn Rn that has proven efﬁcient for image denoising. Although the representation power of the denoiser is limited by the particular form (6), we show (see Section 5.1) that such parameterization still yields state-of-the-art denoising results. We train the denoiser Dσ for Gaussian noise by minimizing the MSE loss function

L(Dσ) = Ex p,ξσ N(0,σ2I)[||Dσ(x + ξσ) x||2], (7)

or L(gσ) = Ex p,ξσ N(0,σ2I)[|| gσ(x + ξσ) ξσ||2], (8)

when written in terms of gσ using equation (4). Remark 1. By deﬁnition, the optimal solution g σ arg min L is related to the MMSE denoiser D σ, that is, the best non-linear predictor of x given x + ξσ. Therefore, it satisﬁes Tweedie s formula and g σ = σ2 log pσ (Efron, 2011) i.e. g σ = σ2 log pσ + C, for some C R. Hence approximating the MMSE denoiser with a denoiser parameterized as (4) is related to approximating the logarithm of the smoothed image prior of pσ with 1

σ2 gσ. This relation was used for image generation with Denoising Score Matching by Saremi & Hyvarinen (2019); Bigdeli et al. (2020).

3.2 A PLUG-AND-PLAY METHOD FOR EXPLICIT MINIMIZATION

The standard Pn P-HQS operator is TPn P-HQS = Dσ Proxτf, i.e. (Id gσ) Proxτf when using the GS denoiser as Dσ. For convergence analysis, we wish to ﬁt the proximal gradient descent (PGD) algorithm. We thus propose to switch the proximal and gradient steps and to relax the denoising step with a parameter λ 0. Our Pn P algorithm with GS denoiser (GS-Pn P) then writes

xk+1 = T τ,λ GS-Pn P(xk) with T τ,λ GS-Pn P = Proxτf (τλDσ + (1 τλ) Id), = Proxτf (Id τλ gσ). (9)

Published as a conference paper at ICLR 2022

Under suitable conditions on f and gσ (see Lemma 1 in Appendix C), ﬁxed points of the PGD operator T τ,λ GS-Pn P correspond to critical points of a classical objective function in IR problems

F(x) = f(x) + λgσ(x). (10)

Therefore, using the GS denoiser from equation (4) is equivalent to include an explicit regularization and thus leads to a tractable global optimization problem solved by the Pn P algorithm. Our complete Pn P scheme is presented in Algorithm 1. It includes a backtracking procedure on the stepsize τ that will be detailed in Section 4.2. Also, after convergence, we found it useful to apply an extra gradient step Id λτ gσ in order to discard the residual noise brought by the last proximal step Proxτf.

4 CONVERGENCE ANALYSIS Algorithm 1: Plug-and-Play image restoration Param.: init. z0, λ > 0, σ 0, ϵ > 0, τ0 > 0, K N , η (0, 1), γ (0, 1/2). Input : degraded image y. Output: restored image ˆx. k = 0; x0 = Proxτf(z0); τ = τ0/η; > ϵ ; while k < K and > ϵ do

zk = λτDσ(xk) + (1 λτ)xk; xk+1 = Proxτf(zk); if F(xk) F(xk+1) < γ

τ ||xk xk+1||2 ; then τ = ητ; else = F (xk) F (xk+1)

F (x0) ; k = k + 1 ; end ˆx = λτDσ(x K) + (1 λτ)x K;

In this section, we introduce conditions on f and Dσ that will ensure the convergence of the Pn P iterations (9) towards a solution of (10). For that purpose, we make use of the literature of convergence analysis (Attouch et al., 2013; Beck, 2017; Beck & Teboulle, 2009) of the PGD algorithm in the nonconvex setting.

4.1 CONVERGENCE RESULTS

A common setting for image restoration is the convex smooth L2 data-ﬁdelity f(x) = ||Ax y||2. In order to cover the noiseless case or to deal with a broader range of common degradation models, like Laplace or Poisson noise, we only assume f to be proper, lower semicontinous and convex. Next, the regularizer gσ is assumed to be differentiable with Lipschitz gradient, but not necessarily convex. This assumption on gσ is reasonable from a practical perspective. Indeed, using a network Nσ with differentiable activation functions, the function gσ introduced in Section 3 is differentiable with Lipschitz gradient (details and proof are given in Appendix B). Without further assumptions on f and gσ, the following theorem establishes the convergence of both the objective function values and the residual for a large variety of IR tasks. Theorem 1 (Proof in Appendix C). Let f : Rn R {+ } and gσ : Rn R be proper lower semicontinous functions with f convex and gσ differentiable with L-Lipschitz gradient. Let λ > 0, F = f + λgσ and assume that F is bounded from below. Then, for τ < 1 λL, the iterates xk given by the iterative scheme (9) verify

(i) (F(xk)) is non-increasing and converges.

(ii) The residual ||xk+1 xk|| converges to 0.

(iii) All cluster points of the sequence (xk) are stationary points of (10). Remark 2. In the nonconvex setting, the quantity γk = min0 i k ||xi+1 xi||2 is commonly used to analyze the convergence rate of the algorithm (Beck & Teboulle, 2009; Ochs et al., 2014). Following the proof of Theorem 1 in Appendix C, we can obtain γk 1

k F (x0) lim F (xk)

2 that is to say a

k) convergence rate for the squared L2 residual. Remark 3. Even if most data-ﬁdelity terms f encountered in image restoration are convex, Theorem 1 can be extended to nonconvex f. The proof of (iii) requires technical adaptations that can be found in Li & Lin (2015, Theorem 1) (as the 1-Lipschitz property of Proxτf does not hold anymore). Such nonconvex f appear for example in the context of phase retrieval (Metzler et al., 2018).

We can further obtain convergence of the iterates by assuming that the generated sequence (xk) is bounded and that f and gσ verify the Kurdyka-Lojasiewicz (KL) property. The boundedness of (xk) is discussed in Appendix D. The KL property (deﬁned in Appendix E) has been widely used to study the convergence of optimization algorithms in the nonconvex setting (Attouch et al., 2010; 2013; Ochs et al., 2014). Very large classes of functions, in particular all the semi-algebraic functions, satisfy this property. In practice, in the extent of our analysis, the KL property is always satisﬁed.

Published as a conference paper at ICLR 2022

Theorem 2 (Proof in Attouch et al. (2013), Theorem 5.1). Let f : Rn R {+ } and gσ : Rn R be proper lower semicontinous functions with f convex and gσ differentiable with L-Lipschitz gradient. Let λ > 0, F = f + λgσ and assume that F is bounded from below. Assume that F verify the KL property. Suppose that τ < 1 λL. If the sequence (xk) given by the iterative scheme (9) is bounded, then it converges, with ﬁnite length, to a critical point x of F.

Remark 4. As explained in Attouch et al. (2013, Remark 5.2), the continuity of f is not required since we use the exact forward-backward splitting algorithm . We can thus deal with non-continuous data-ﬁdelity terms, as it is the case with the inpainting application in Appendix J.3.

A more detailed description of all the assumptions of Theorems 1 and 2 is given in Appendix F.

4.2 BACKTRACKING TO HANDLE THE LIPSCHITZ CONSTANT OF gσ

The convergence of Algorithm 1 actually requires to control the Lipschitz constant of gσ only on a small subset of images related to {xk}. Therefore, estimating L for all images and setting the maximum stepsize τ accordingly will lead to sub-optimal convergence speed. In order to avoid small stepsizes, we use the backtracking strategy of Beck (2017, Chapter 10) and Ochs et al. (2014).

The convergence study in the proof of Theorem 1 is based on the sufﬁcient decrease property of F established in equation (33). Without knowing the exact Lipschitz constant L, backtracking aims at ﬁnding the maximal stepsize τ yielding the sufﬁcient decrease property. Given γ (0, 1/2), η [0, 1) and an initial stepsize τ0 > 0, the following update rule on τ is applied at each iteration k:

while F(xk) F(T τ,λ GS-Pn P(xk)) < γ

τ ||T τ,λ GS-Pn P(xk) xk||2, τ ητ. (11)

Proposition 1 (Proof in Appendix G). Under the assumptions of Theorem 1, at each iteration of the algorithm, the backtracking procedure (11) is ﬁnite (i.e. a stepsize satifying F(xk) F(T τ,λ GS-Pn P(xk)) γ

τ ||T τ,λ GS-Pn P(xk) xk||2 is found in a ﬁnite number if iterations), and with backtracking, the convergence results of Theorem 1 and Theorem 2 still hold.

Remark 5. In practice, as explained in Section 5, we choose to initialize the stepsize as λτ0 = 1 and backtracking hardly ever activates. In this particular case, our results prove convergence of the standard Pn P-HQS scheme xk+1 = Proxτf Dσ(xk) applied with our speciﬁc denoiser.

5 EXPERIMENTS

In this section, we ﬁrst show the performance of the GS denoiser. Next we empirically conﬁrm that our GS-Pn P method is convergent while providing state-of-the art results for different IR tasks.

5.1 GRADIENT-DESCENT-BASED DENOISER

Denoising Network Architecture We choose to parameterize Nσ with the architecture DRUNet (Zhang et al., 2021)) (represented in Appendix H), a U-Net in which residual blocks are integrated. One ﬁrst beneﬁt of DRUNet is that it is built to take the noise level σ as input, which is consistent with our formulation. Also, the U-Net models have previously offered good results in the context of prior approximation via Denoising Score Matching (Ho et al., 2020). Furthermore, Zhang et al. (2021) showed that DRUNet yields state-of-the-art results for denoising but also for Pn P image restoration. In order to ensure differentiability w.r.t the input, we change RELU activations to ELU. We also limit the number of residual blocks to 2 at each scale to lower the computational burden.

Training details We use the color image training dataset proposed in Zhang et al. (2021) i.e. a combination of the Berkeley segmentation dataset (CBSD) (Martin et al., 2001), Waterloo Exploration Database (Ma et al., 2017), DIV2K dataset (Agustsson & Timofte, 2017) and Flick2K dataset (Lim et al., 2017). During training, the input images are corrupted with a white Gaussian noise ξσ with standard deviation σ randomly chosen in [0, 50/255]. With our parameterization (6) of Dσ, the network Nσ is trained to minimize the L2 loss (7). While the original DRUNet is trained with L1 loss, we stick to the L2 loss to keep the interpretability of gσ as an approximation of the log prior (see Remark 1). For each batch, the gradient gσ is computed with Py Torch differentiation tools.

Published as a conference paper at ICLR 2022

We train the model on 128 128 patches randomly sampled from the training images, with batch size 16, during 1500 epochs. We use the ADAM optimizer with learning rate 10 4, divided by 2 every 300 epochs. It takes around one week to train the model on a single Tesla P100 GPU.

Denoising results We evaluate the PSNR performance of the proposed GS denoiser (GS-DRUNet) on 256 256 color images center-cropped from the original CBSD68 dataset images (Martin et al., 2001). In Table 1, we compare, for various noise levels σ, our model with the simpliﬁed DRUNet (called DRUNet light ) that has the same architecture as our GS-DRUNet (2 residual blocks) and that is trained (with L2 loss) without the conservative ﬁeld constraint. We also provide comparisons with the original DRUNet (Zhang et al., 2021) (with 4 residual blocks at each scale and trained with L1 loss) and two state-of-the-art denoisers encountered in the Pn P literature: FFDNet (Zhang et al., 2018) and Dn CNN (Zhang et al., 2017a). For each network, we indicate in Table 1 the average runtime while processing a 256 256 color image on one Tesla P100 GPU.

σ(./255) 5 15 25 50 Time (ms)

FFDNet 39.95 33.53 30.84 27.54 1.9 Dn CNN 39.80 33.55 30.87 27.52 2.3 DRUNet 40.31 33.97 31.32 28.08 69.8

DRUNet light 40.19 33.89 31.25 28.00 6.3 GS-DRUNet 40.26 33.90 31.26 28.01 10.4

Table 1: Average PSNR denoising performance and runtime of our GS denoiser on 256 256 center-cropped images from the CBSD68 dataset, for various noise levels σ. While keeping small runtime, GS-DRUNet slightly outperforms its unconstrained counterpart DRUNet light and outdistances the deep denoisers FFDNet and Dn CNN.

Our GS-DRUNet denoiser, despite being constrained to be an exact conservative ﬁeld, reaches the performance of (and even slightly outperforms) its unconstrained counterpart DRUNet light. Second, departing from the latter, we are able to reduce the processing time by a large margin ( 7) while keeping close PSNR to the original DRUNet (around -0.05d B) and maintaining a signiﬁcant PSNR gap (around +0.5d B) with other deep denoisers like Dn CNN and FFDNet. Note that the time difference between GS-DRUNet and DRUNet light is due to the computation of gσ via backpropagation. These results indicate that GS-DRUNet is likely to yield a competitive and fast Pn P algorithm.

5.2 PLUG-AND-PLAY IMAGE RESTORATION

We show in this section that, in addition to being convergent, our Pn P Alg. 1 reaches state-of-the-art performance in deblurring and super-resolution (Sections 5.2.1 and 5.2.2) and realizes relevant inpainting (Appendix J.3). In all cases, we seek an estimate x of a clean image ˆx Rn, from an observation obtained as y = Aˆx + ξν Rm, with A a m n degradation matrix and ξν a white Gaussian noise with zero mean and standard deviation ν. The objective function minimized by Alg. 1 is

F(x) = 1 2ν2 ||Ax y||2 + λ

2 ||Nσ(x) x||2. (12)

In practice, for ν > 0, we multiply F in equation (12) by ν2 and consider F = f + λνgσ with f(x) = 1

2||Ax y||2, gσ(x) = 1

2||Nσ(x) x||2 and λν = λν2. With this formulation, the convergences of iterates and objective values are guaranteed by Theorems 1 and 2. We also demonstrate in Appendix J.3 that our framework can be extended to other kinds of objective functions. For example, inpainting noise-free input images leads to a non differentiable data-ﬁdelity term f.

Due to the large computational time of some compared methods, we use for evaluation and comparison a subset of 10 color images taken from the CBSD68 dataset (CBSD10) together with the 3 famous set3C images (butterﬂy, leaves and starﬁsh). Quantitative results run on the full CBSD68 dataset are given in Appendix J. All images are center-cropped to the size 256 256. For each IR problem, we provide default values for the parameters σ and λ that can be used to treat sucessfully a large class of images and degradations. The inﬂuence of both parameters is analyzed in Appendix J.5. Performance can be marginally improved by tuning λ for each image, for example with the method of Wei et al. (2020) based on reinforcement learning. In our experiments, backtracking is performed with η = 0.9 and γ = 0.1. We observe (see Appendix B) that on a majority of images, the Lipschitz constant L of gσ is slightly larger than 1. As convergence is ensured for λντ = ν2λτ < 1

L, we set the initial stepsize to τ0 = (ν2λ) 1. At the ﬁrst iteration, the gradient step in equation (9) is thus exactly Dσ. In the majority of our experiments, backtracking is never activated. The algorithm is initialized with a proximal step and terminates when the relative difference between consecutive values of the objective function is less than ϵ or the number of iterations exceeds K.

Published as a conference paper at ICLR 2022

5.2.1 DEBLURRING

For image deblurring, the degradation operator A = H is a convolution performed with circular boundary conditions. Therefore, H = F ΛF, where F is the orthogonal matrix of the discrete Fourier transform (and F its inverse), and Λ is a diagonal matrix. The proximal operator of the data-ﬁdelity term f(x) = 1

2||Hx y||2 involves only element-wise inversion and writes

Proxτf(z) = F (In + τΛ Λ) 1F(τHT y + z). (13) We demonstrate the effectiveness of our method on a large variety of blur kernels (represented in Table 2) and noise levels. As in (Zhang et al., 2017b; Pesquet et al., 2021; Zhang et al., 2021), we use the 8 real-world camera shake kernels proposed in Levin et al. (2009) as well as the 9 9 uniform kernel and the 25 25 Gaussian kernel with standard deviation 1.6 (as in (Romano et al., 2017)). We consider Gaussian noise with 3 noise levels ν {2.55, 7.65, 12.75}/255 i.e. ν {0.01, 0.03, 0.05}. For all noise levels, we set σ = 1.8ν, λν = ν2λ = 0.1 for motion blur (kernels (a) to (h)) and λν = 0.075 for static blur (kernels (i) and (j)). Initialization is done with z0 = y but we show in Appendix J.6 the robustness to the initialization. The stopping criteria are ϵ = 10 5 and K = 400.

We compare in Table 2 our method (GS-Pn P) against the patch-based method EPLL (Zoran & Weiss, 2011; Hurault et al., 2018), the deep Pn P methods IRCNN (Zhang et al., 2017b), DPIR (Zhang et al., 2021), MMO (Pesquet et al., 2021), and the RED-FP algorithm (Romano et al., 2017) (with TNRD denoiser (Chen & Pock, 2016)) referred to as RED. Both IRCNN and DPIR use Pn P-HQS with a fast decrease of τ and σ in a few iterations (8 iterations for DPIR) without guarantee of convergence. DPIR uses the DRUNet denoiser from Table 1. MMO is the only compared method that guarantees convergence by plugging a Dn CNN denoiser trained with Lipschitz constraints (but the only given network was trained for very low noise level). Finally, as RED only treats the Y channel in the YCb Cr color space, we also indicate in Appendix J.1, for RED and the proposed method, the PSNR evaluated on the Y channel only.

Among all methods, GS-Pn P closely follows DPIR in terms of PSNR for low noise level but performs equally or better for higher noise levels. Other comparisons are conducted in Appendix J.1 on the Set3c and the full CBSD68 datasets (Tables 4 and 5). These results exhibit that GS-Pn P reaches state-of-the-art in Pn P deblurring for a variety of kernels and noise levels. We underline that the convergence of GS-Pn P is guaranteed, whereas DPIR can asymptotically diverge (see Appendix J.7).

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)

EPLL 28.32 28.24 28.36 25.80 29.61 27.15 26.90 26.69 25.84 26.49 27.34 RED 30.47 30.01 30.29 28.09 31.22 28.92 28.90 28.67 26.66 28.45 29.17 IRCNN 32.96 32.62 32.53 32.44 33.51 33.62 32.54 32.20 28.11 29.19 31.97 MMO 32.35 32.06 32.24 31.67 31.77 33.17 32.30 31.80 27.81 29.26 31.44 DPIR 33.76 33.30 33.04 33.09 34.10 34.34 33.06 32.77 28.34 29.16 32.50 GS-Pn P 33.52 33.07 32.91 32.83 34.07 34.25 32.96 32.54 28.11 29.03 32.33

EPLL 25.31 25.12 25.82 23.75 26.99 25.23 25.00 24.59 24.34 25.43 25.16 RED 25.71 25.32 25.71 24.38 26.65 25.50 25.27 24.99 23.51 25.54 25.26 IRCNN 28.96 28.65 28.90 28.38 30.03 29.87 28.92 28.52 25.92 27.64 28.58 IRCNN 28.96 28.65 28.90 28.38 30.03 29.87 28.92 28.52 25.92 27.64 28.58 DPIR 29.38 29.06 29.21 28.77 30.22 30.23 29.34 28.90 26.19 27.81 28.91 GS-Pn P 29.22 28.89 29.20 28.60 30.32 30.21 29.32 28.92 26.38 27.89 28.90

EPLL 24.08 23.91 24.78 22.57 25.68 23.98 23.70 23.19 23.75 24.78 24.04 RED 22.78 22.54 23.13 21.92 23.78 22.97 22.89 22.67 22.01 23.78 22.84 IRCNN 27.00 26.74 27.25 26.37 28.29 28.06 27.22 26.81 24.85 26.83 26.94 DPIR 27.52 27.35 27.73 27.02 28.63 28.46 27.79 27.30 25.25 27.11 27.42 GS-Pn P 27.45 27.28 27.70 26.98 28.68 28.44 27.81 27.38 25.49 27.15 27.44 Table 2: PSNR(d B) comparison of image deblurring methods on CBSD10 with various blur kernels k and noise levels ν. Best and second best results are displayed in bold and underlined.

For qualitative comparison, we show in Figure 1(c-f) the deblurring obtained with various methods on the image starﬁsh (from set3C). Note that our algorithm, compared to competing methods, can recover the sharpest edges. We also give convergence curves that empirically conﬁrm the convergence of the values F(xk) (g) and of the residual min0 i k ||xi+1 xi||2/||x0||2 (h). These observations are supported by the additional experiment shown in Appendix J.1, Figure 6.

Published as a conference paper at ICLR 2022

(b) Observed (20.97d B)

(c) RED (25.92d B)

(d) IRCNN (28.66d B)

(e) DPIR (29.76d B)

(f) GS-Pn P (29.90d B)

(h) γk (log scale)

Figure 1: Deblurring with various methods of starﬁsh degraded with the indicated blur kernel and input noise level ν = 0.03. Note that our algorithm better recovers the structures. In (g) and (h), we show the evolution of F(xk) and γk = min0 i k ||xi+1 xi||2/||x0||2 and in Appendix J.4 of the PSNR. We empirically verify convergence of functional values and residual. Note that the empirical convergence rate in (h) is faster than the O( 1

k) theoretical worst case rate established in Remark 2.

5.2.2 SUPER-RESOLUTION

For single image super-resolution, the low-resolution image y Rm is obtained from the highresolution one x Rn via y = SHx + ξν where H Rn n is the convolution with anti-aliasing kernel. The matrix S is the standard s-fold downsampling matrix of size m n and n = s2 m. In this context, we make use of the closed-form calculation of the proximal map for the data-ﬁdelity term f(x) = 1

2||SHx y||2, given by Zhao et al. (2016):

Proxτf(z) = ˆzτ 1

s2 F Λ Im + τ

s2 ΛΛ 1 ΛF ˆzτ, (14)

where ˆzτ = τHT ST y + z and Λ = [Λ1, . . . , Λs2] Rm n, with Λ = diag(Λ1, . . . , Λs2) a blockdiagonal decomposition according to a s s paving of the Fourier domain. Note that Im + τ

is a m m diagonal matrix and its inverse is computed element-wise. As expected, with s = 1, equation (14) comes down to equation (13).

As in Zhang et al. (2021), we evaluate super-resolution performance on 8 Gaussian blur kernels represented in Table 3: 4 isotropic kernels with different standard deviations (0.7, 1.2, 1.6 and 2.0) and 4 anisotropic kernels. Results are averaged between isotropic and anisotropic. We consider downsampled images at scale s = 2 and s = 3 and Gaussian noise with 3 different noise levels ν {0.01, 0.03, 0.05}. Our method (GS-Pn P) is compared against bicubic upsampling, RED, IRCNN ( IRCNN+ from (Zhang et al., 2021)) and DPIR. We give again in Appendix J.2 the results obtained on the Set3C dataset (Table 7). All our results are obtained with λν = ν2λ = 0.065 and σ = 2ν. Initialization z0 is done with a bicubic interpolation of y (with a shift correction (Zhang et al., 2021)) and the stopping criteria are ϵ = 10 6 and K = 400.

Besides being the only compared Pn P method with convergence guarantee, GS-Pn P outperforms in PSNR all other Pn P algorithms over the considered range of blur kernels, noise levels and scale factors. We show in Figure 2 the super-resolution of the image leaves downsampled by 2, with an isotropic kernel and noise level ν = 0.03. GS-Pn P (f) recovers more accurately structures and color details than competing approaches (c-e), while converging in terms of function values (g) and residual (h). Additional visual comparisons are presented in Appendix J.2.

Published as a conference paper at ICLR 2022

Kernels Method s = 2 s = 3 Avg ν = 0.01 ν = 0.03 ν = 0.05 ν = 0.01 ν = 0.03 ν = 0.05

Bicubic 24.85 23.96 22.79 23.14 22.52 21.62 23.15 RED 28.29 24.65 22.98 26.13 24.02 22.37 24.74 IRCNN 27.43 26.22 25.86 26.12 25.11 24.79 25.92 DPIR 28.62 27.30 26.47 26.88 25.96 25.22 26.74 GS-Pn P 28.77 27.54 26.63 26.85 26.05 25.29 26.86

Bicubic 23.38 22.71 21.78 22.65 22.08 21.25 22.31 RED 26.33 23.91 22.45 25.38 23.40 21.91 23.90 IRCNN 25.83 24.89 24.59 25.36 24.36 23.95 24.83 DPIR 26.84 25.59 24.89 26.24 24.98 24.32 25.48 GS-Pn P 26.80 25.73 25.03 26.18 25.08 24.31 25.52

Table 3: PSNR(d B) comparison of image super-resolution methods on CBSD10 with various scales s, blur kernels k and noise levels ν. PNSR results are averaged over kernels at each row.

(b) Observed

(c) RED (21.75d B)

(d) IRCNN (22.82d B)

(e) DPIR (23.97d B)

(f) GS-Pn P (24.81d B)

(h) γk (log scale)

Figure 2: Super-resolution with various methods on leaves (set3C) downsampled by 2, with the indicated blur kernel and input noise level ν = 0.03. Note that our algorithm is the one that recovers sharpest leaves. In (g) and (h), we show the evolution of F(xk) and γk = min0 i k ||xi+1 xi||2/||x0||2 and in Appendix J.4 the evolution of the PSNR.. The empirical convergence rate is faster than the O( 1

k) theoretical worst case rate established in Remark 2.

6 CONCLUSION

In this work, we introduce a new Pn P algorithm with convergence guarantees. A denoiser is trained to realize an exact gradient step on a regularization function that is formulated through a neural network. This denoiser is plugged in an iterative scheme closely related to Pn P-HQS, which is proved to converge towards a stationary point of an explicit functional. One strength of this approach is to simultaneously allow for a non strongly convex (and non smooth) data-ﬁdelity term with a denoiser that may not be nonexpansive. Experiments conducted on ill-posed imaging problems (deblurring, super-resolution, inpainting) conﬁrm the convergence results and show that the proposed Pn P algorithm reaches state-of-the-art image restoration performance. This work also opens several research perspectives. One could ﬁrst examine which information is encoded in the proposed prior. For example, based on the sharp visual results, one can question if a relation can be drawn between this prior and the gradient energy or sparsity. Also, it would be interesting to see if the recovery analysis developed in (Liu et al., 2021) can adapt to the proposed framework.

Published as a conference paper at ICLR 2022

7 REPRODUCIBILITY STATEMENT

Anonymous source code is given in supplementary material. It contains a README.md ﬁle that explains step by step how to run the algorithm and replicate the results of the paper. Moreover, the pseudocode of our algorithm is given Algorithm 1. In Section 5 it is precisely detailed how all the hyper-parameters are chosen and, for each experiment, which dataset is used. As for the theoretical results presented in Section 4, complete proofs are given in the appendixes.

8 ETHICS STATEMENT

We believe that this work does not raise potential ethical concerns.

ACKNOWLEDGEMENTS

This work was funded by the French ministry of research through a CDSN grant of ENS Paris Saclay. This study has also been carried out with ﬁnancial support from the French Research Agency through the Post Prod LEAP and Mistic projects (ANR-19-CE23-0027-01 and ANR-19-CE40-005).

Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 126 135, 2017.

Rizwan Ahmad, Charles A Bouman, Gregery T Buzzard, Stanley Chan, Sizhuo Liu, Edward T Reehorst, and Philip Schniter. Plug-and-play methods for magnetic resonance imaging: Using denoisers for image recovery. IEEE signal processing magazine, 37(1):105 116, 2020.

H edy Attouch, J erˆome Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdykałojasiewicz inequality. Mathematics of operations research, 35(2):438 457, 2010.

Hedy Attouch, J erˆome Bolte, and Benar Fux Svaiter. Convergence of descent methods for semialgebraic and tame problems: proximal algorithms, forward backward splitting, and regularized gauss seidel methods. Mathematical Programming, 137(1-2):91 129, 2013.

H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2011.

Amir Beck. First-order methods in optimization. SIAM, 2017.

Amir Beck and Marc Teboulle. Gradient-based algorithms with applications to signal recovery. Convex optimization in signal processing and communications, pp. 42 88, 2009.

Siavash A Bigdeli, Geng Lin, Tiziano Portenier, L Andrea Dunbar, and Matthias Zwicker. Learning generative models using denoising density estimators. ar Xiv preprint ar Xiv:2001.02728, 2020.

Siavash Arjomand Bigdeli and Matthias Zwicker. Image restoration using autoencoding priors. ar Xiv preprint ar Xiv:1703.09964, 2017.

Siavash Arjomand Bigdeli, Meiguang Jin, Paolo Favaro, and Matthias Zwicker. Deep mean-shift priors for image restoration. ar Xiv preprint ar Xiv:1709.03749, 2017.

Pakshal Bohra, Alexis Goujon, Dimitris Perdios, S ebastien Emery, and Michael Unser. Learning lipschitz-controlled activation functions in neural networks for plug-and-play image reconstruction methods. In Neur IPS 2021 Workshop on Deep Learning and Inverse Problems, 2021.

J erˆome Bolte, Aris Daniilidis, Olivier Ley, and Laurent Mazet. Characterizations of łojasiewicz inequalities: subgradient ﬂows, talweg, convexity. Transactions of the American Mathematical Society, 362(6):3319 3363, 2010.

Published as a conference paper at ICLR 2022

Gregery T Buzzard, Stanley H Chan, Suhas Sreehari, and Charles A Bouman. Plug-and-play unplugged: Optimization-free reconstruction using consensus equilibrium. SIAM Journal on Imaging Sciences, 11(3):2001 2020, 2018.

Luca Calatroni and Antonin Chambolle. Backtracking strategies for accelerated descent methods with smooth composite objectives. SIAM Journal on Optimization, 29(3):1772 1798, 2019.

Stanley H Chan. Performance analysis of plug-and-play admm: A graph signal processing perspective. IEEE Transactions on Computational Imaging, 5(2):274 286, 2019.

Stanley H Chan, Xiran Wang, and Omar A Elgendy. Plug-and-play ADMM for image restoration: Fixed-point convergence and applications. IEEE Transactions on Computational Imaging, 3(1): 84 98, 2016.

Yunjin Chen and Thomas Pock. Trainable nonlinear reaction diffusion: A ﬂexible framework for fast and effective image restoration. IEEE transactions on pattern analysis and machine intelligence, 39(6):1256 1272, 2016.

Regev Cohen, Michael Elad, and Peyman Milanfar. Regularization by denoising via ﬁxed-point projection (RED-PRO). SIAM Journal on Imaging Sciences, 14(3):1374 1406, 2021.

Patrick L. Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185 212. Springer, 2011.

Bradley Efron. Tweedie s formula and selection bias. Journal of the American Statistical Association, 106(496):1602 1614, 2011.

Ruturaj G Gavaskar and Kunal N Chaudhury. Plug-and-play ista converges with kernel denoisers. IEEE Signal Processing Letters, 27:610 614, 2020.

Ruturaj G Gavaskar, Chirayu D Athalye, and Kunal N Chaudhury. On plug-and-play regularization using linear denoisers. IEEE Transactions on Image Processing, 2021.

Ruturaj Girish Gavaskar and Kunal Narayan Chaudhury. On the proof of ﬁxed-point convergence for plug-and-play ADMM. IEEE Signal Processing Letters, 26(12):1817 1821, 2019.

Donald Geman and Chengda Yang. Nonlinear image recovery with half-quadratic regularization. IEEE transactions on Image Processing, 4(7):932 946, 1995.

Mario Gonz alez, Andr es Almansa, and Pauline Tan. Solving inverse problems by joint posterior maximization with autoencoding prior. ar Xiv preprint ar Xiv:2103.01648, 2021.

Johannes Hertrich, Sebastian Neumayer, and Gabriele Steidl. Convolutional proximal neural networks and plug-and-play algorithms. Linear Algebra and its Applications, 631:203 234, 2021.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. ar Xiv preprint ar Xiv:2006.11239, 2020.

Samuel Hurault, Thibaud Ehret, and Pablo Arias. Epll: an image denoising method using a gaussian mixture model learned on a large set of patches. Image Processing On Line, 8:465 489, 2018.

Zahra Kadkhodaie and Eero Peter Simoncelli. Solving linear inverse problems using the prior implicit in a denoiser. In Neur IPS 2020 Workshop on Deep Learning and Inverse Problems, 2020.

R emi Laumont, Valentin De Bortoli, Andr es Almansa, Julie Delon, Alain Durmus, and Marcelo Pereyra. Bayesian imaging using plug & play priors: when langevin meets tweedie. ar Xiv preprint ar Xiv:2103.04715, 2021.

Anat Levin, Yair Weiss, Fredo Durand, and William T Freeman. Understanding and evaluating blind deconvolution algorithms. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1964 1971. IEEE, 2009.

Huan Li and Zhouchen Lin. Accelerated proximal gradient methods for nonconvex programming. Advances in neural information processing systems, 28:379 387, 2015.

Published as a conference paper at ICLR 2022

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136 144, 2017.

Jiaming Liu, Yu Sun, Cihat Eldeniz, Weijie Gan, Hongyu An, and Ulugbek S Kamilov. Rare: Image reconstruction using deep priors learned without groundtruth. IEEE Journal of Selected Topics in Signal Processing, 14(6):1088 1099, 2020.

Jiaming Liu, M Salman Asif, Brendt Wohlberg, and Ulugbek S Kamilov. Recovery analysis for plug-and-play priors using the restricted eigenvalue condition. ar Xiv preprint ar Xiv:2106.03668, 2021.

Sebastian Lunz, Ozan Oktem, and Carola-Bibiane Sch onlieb. Adversarial regularizers in inverse problems. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8516 8525, 2018.

Kede Ma, Zhengfang Duanmu, Qingbo Wu, Zhou Wang, Hongwei Yong, Hongliang Li, and Lei Zhang. Waterloo Exploration Database: New challenges for image quality assessment models. IEEE Transactions on Image Processing, 26(2):1004 1016, Feb. 2017.

St ephane Mallat. A Wavelet Tour of Signal Processing, The Sparse Way. Academic Press, Elsevier, 3rd edition edition, 2009. ISBN 978-0-12-374370-1.

D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int l Conf. Computer Vision, volume 2, pp. 416 423, July 2001.

Tim Meinhardt, Michael Moller, Caner Hazirbas, and Daniel Cremers. Learning proximal operators: Using denoising networks for regularizing inverse imaging problems. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1781 1790, 2017.

Christopher Metzler, Phillip Schniter, Ashok Veeraraghavan, et al. prdeep: robust phase retrieval with a ﬂexible deep network. In International Conference on Machine Learning, pp. 3501 3510. PMLR, 2018.

Jean-Jacques Moreau. Proximit e et dualit e dans un espace hilbertien. Bulletin de la Soci et e Math ematique de France, 93:273 299, 1965. URL http://eudml.org/doc/87067.

Pravin Nair, Ruturaj Girish Gavaskar, and Kunal Narayan Chaudhury. Fixed-point and objective convergence of plug-and-play algorithms. IEEE Transactions on Computational Imaging, 2021.

Peter Ochs, Yunjin Chen, Thomas Brox, and Thomas Pock. ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sciences, 7(2):1388 1419, 2014.

Jean-Christophe Pesquet, Audrey Repetti, Matthieu Terris, and Yves Wiaux. Learning maximally monotone operators for image recovery. SIAM Journal on Imaging Sciences, 14(3):1206 1237, 2021.

Jean Prost, Antoine Houdard, Andr es Almansa, and Nicolas Papadakis. Learning local regularization for variational image restoration. ar Xiv preprint ar Xiv:2102.06155, 2021.

Edward T Reehorst and Philip Schniter. Regularization by denoising: Clariﬁcations and new interpretations. IEEE transactions on computational imaging, 5(1):52 67, 2018.

Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804 1844, 2017.

Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Phys. D, 60:259 268, 1992.

Ernest Ryu, Jialin Liu, Sicheng Wang, Xiaohan Chen, Zhangyang Wang, and Wotao Yin. Plug-andplay methods provably converge with properly trained denoisers. In International Conference on Machine Learning, pp. 5546 5557. PMLR, 2019.

Published as a conference paper at ICLR 2022

Tim Salimans and Jonathan Ho. Should EBMs model the energy or the score? In Energy Based Models Workshop-ICLR 2021, 2021.

Saeed Saremi. On approximating f with neural networks. ar Xiv preprint ar Xiv:1910.12744, 2019.

Saeed Saremi and Aapo Hyvarinen. Neural empirical bayes. ar Xiv preprint ar Xiv:1903.02334, 2019.

Katya Scheinberg, Donald Goldfarb, and Xi Bai. Fast ﬁrst-order methods for composite convex optimization with backtracking. Foundations of Computational Mathematics, 14(3):389 417, 2014.

Suhas Sreehari, S Venkat Venkatakrishnan, Brendt Wohlberg, Gregery T Buzzard, Lawrence F Drummy, Jeffrey P Simmons, and Charles A Bouman. Plug-and-play priors for bright ﬁeld electron tomography and sparse interpolation. IEEE Transactions on Computational Imaging, 2(4): 408 423, 2016.

Yu Sun, Jiaming Liu, and Ulugbek S Kamilov. Block coordinate regularization by denoising. ar Xiv preprint ar Xiv:1905.05113, 2019a.

Yu Sun, Brendt Wohlberg, and Ulugbek S Kamilov. An online plug-and-play algorithm for regularized image reconstruction. IEEE Transactions on Computational Imaging, 5(3):395 408, 2019b.

Yu Sun, Jiaming Liu, Yiran Sun, Brendt Wohlberg, and Ulugbek S Kamilov. Async-red: A provably convergent asynchronous block parallel stochastic method using deep denoising priors. ar Xiv preprint ar Xiv:2010.01446, 2020.

Yu Sun, Zihui Wu, Xiaojian Xu, Brendt Wohlberg, and Ulugbek S Kamilov. Scalable plug-and-play admm with convergence guarantees. IEEE Transactions on Computational Imaging, 7:849 863, 2021.

Matthieu Terris, Audrey Repetti, Jean-Christophe Pesquet, and Yves Wiaux. Building ﬁrmly nonexpansive convolutional neural networks. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8658 8662. IEEE, 2020.

Singanallur V Venkatakrishnan, Charles A Bouman, and Brendt Wohlberg. Plug-and-play priors for model based reconstruction. In 2013 IEEE Global Conference on Signal and Information Processing, pp. 945 948. IEEE, 2013.

Kaixuan Wei, Angelica Aviles-Rivero, Jingwei Liang, Ying Fu, Carola-Bibiane Sch onlieb, and Hua Huang. Tuning-free plug-and-play proximal algorithm for inverse imaging problems. In International Conference on Machine Learning, pp. 10158 10169. PMLR, 2020.

Xiaojian Xu, Yu Sun, Jiaming Liu, Brendt Wohlberg, and Ulugbek S Kamilov. Provable convergence of plug-and-play priors with mmse denoisers. ar Xiv preprint ar Xiv:2005.07685, 2020.

Xin Yuan, Yang Liu, Jinli Suo, and Qionghai Dai. Plug-and-play algorithms for large-scale snapshot compressive imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26 (7):3142 3155, 2017a.

Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. Learning deep cnn denoiser prior for image restoration. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3929 3938, 2017b.

Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward a fast and ﬂexible solution for cnnbased image denoising. IEEE Transactions on Image Processing, 27(9):4608 4622, 2018.

Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc Van Gool, and Radu Timofte. Plug-and-play image restoration with deep denoiser prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

Published as a conference paper at ICLR 2022

Ningning Zhao, Qi Wei, Adrian Basarab, Nicolas Dobigeon, Denis Kouam e, and Jean-Yves Tourneret. Fast single image super-resolution using a new analytical solution for ℓ2 ℓ2 problems. IEEE Transactions on Image Processing, 25(8):3683 3697, 2016.

Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration. In 2011 International Conference on Computer Vision, pp. 479 486. IEEE, 2011.

A COMPARISON WITH THE PNP LITERATURE

A.1 LIMITATIONS OF PREVIOUS PNP METHODS

In the existing literature, Pn P approaches have one of the following limitations:

They are not able to provide proof of convergence when non strongly convex data-ﬁdelity terms are involved (Ryu et al., 2019), which is the case of some classical IR problems such as deblurring, super-resolution or inpainting. They are restricted to (nearly) nonexpansive denoisers (Reehorst & Schniter, 2018; Ryu et al., 2019; Sun et al., 2019b; Xu et al., 2020) or denoisers with a symmetric Jacobian (Romano et al., 2017). But it has already been shown that imposing symmetric Jacobian or Lipschitz property on a deep denoiser network alters its denoising performance (Bohra et al., 2021; Hertrich et al., 2021). We highlight that it was already empirically observed (Romano et al., 2017; Zhang et al., 2021) that the performance of the denoiser directly impacts the performance of the corresponding Pn P scheme for IR. They show convergence of iterates thanks to decreasing time steps (Chan et al., 2016), but there is no characterization of the obtained solution (it is not a minima or a critical point of any functional).

On the other hand, our method is proved to converge to a stationary point of an explicit functional including a non strongly convex data-ﬁdelity term. It also relies on a (possibly expansive) denoiser that, although being constrained to be a conservative vector ﬁeld, allows to produce state-of-the-art results for various ill-posed IR problems.

A.2 ON THE REGULARIZATION gσ.

We ﬁrst underline that the main point of our method is to deﬁne the denoiser as Dσ = Id gσ. The choice for gσ is important for the denoising performance. With respect to the convergence properties of GS-Pn P, this is nevertheless a secondary issue, as our method would converge for other differentiable regularizers gσ.

The proposed regularization gσ(x) = 1

2||x Nσ(x)||2 was previously mentioned in the RED original paper (Romano et al., 2017) (but explicitly left aside) and used in the DAEP paper (Bigdeli & Zwicker, 2017). The main difference between our regularizer and the one alternately proposed in RED and DAEP is the following:

RED and DAEP both consider a generic given pretrained denoiser Dσ : Rn Rn, which is then associated with the regularizer gσ(x) = 1

2||x Dσ(x)||2 and used as such in IR problems. In our method, we set gσ(x) = 1

2||x Nσ(x)||2 (with Nσ : Rn Rn differentiable) and then we train the denoiser as Dσ = Id gσ with the loss function ||Dσ(x + ξ) x||2 for clean images x and additive white Gaussian noise (AWGN) ξ.

With this new formulation, we are ensured that Dσ = Id gσ is inherently a conservative vector ﬁeld, without further assumptions on Nσ. Thanks to this relation, the (slightly modiﬁed) Pn P-HQS given in relation (9) becomes a proximal gradient descent (PGD). We can then make use of convergence results of the PGD algorithm in the nonconvex setting to show the convergence of Pn P-HQS.

In contrast to the original RED paper, we aimed at ﬁnding one setting of Plug-and-Play image restoration that allows for a convergence proof with sufﬁciently general hypotheses. For this purpose, we had to consider this very particular form of regularization.

Published as a conference paper at ICLR 2022

A.3 RECENT LITERATURE ON REGULARIZATION BY DENOISING

We here provide a more detailed discussion on the follow-up literature on RED.

In parallel to the RED method (Romano et al., 2017), the authors of Bigdeli & Zwicker (2017) propose to use the regularization, mentioned but not exploited in RED, g(x) = ||Dσ(x) x||2, where Dσ is a pretrained denoising autoencoder. Next, Bigdeli et al. (2017) extended this work with a new prior, which is the Gaussian-smoothed version of the natural image prior. Inspired by Tweedie s formula, they approximate the gradient of this log smoothed prior with the residual of a pretrained denoising autoencoder. With this new formulation, it is possible to optimize on the restored image but also on other parameters (e.g. the noise level and the used blur kernel).

Initially designed in the context of convex data-ﬁdelity term, RED Romano et al. (2017) has been applied in the nonconvex setting for phase retrieval problems in pr Deep Metzler et al. (2018).

Regularization by Artifact-Removal (RARE) (Liu et al., 2020) extends the RED framework by replacing the denoiser by a more general artifact-removal operator. The main advantage of this operator is that it can be trained without groundtruth data, but only by mapping pairs of artifact and noise contaminated images obtained directly from undersampled measurements. This is particularly useful for medical imaging applications where it is difﬁcult to acquire fully-sampled training data.

The convergence of the original RED algorithm is discussed in Reehorst & Schniter (2018). The authors provide a convergence proof for RED-PGD which requires the denoiser to be nonexpansive, which, as detailed in the previous sections, is a restrictive hypothesis.

The authors of Liu et al. (2021) provide a recovery guarantee for the Pn P framework, meaning convergence to a x that satisﬁes y = Ax while being in the set Fix(D) of the ﬁxed points of D. More precisely, they show the convergence of the Pn P-PGD method towards such a true solution x under the assumptions that the denoiser residual R = Id D is bounded and Lipschitz, and that the measurement operator satisﬁes a set-restricted eigenvalue condition (S-REC, which can be understood as strong convexity on the image of the denoiser Im(D)). Under the additional assumptions that the denoiser is nonexpansive and that there exists x Fix(D) that is also critical for the regularizer g, they show that Pn P and RED have the same solutions. As mentioned by the authors, it is nevertheless difﬁcult to verify the S-REC condition for a given measurement operator: since Im(D) is not explicit, it is not clear how much S-REC relaxes the strong convexity. As explained in Sections 2 and 3, our results do not require strong convexity of the data-ﬁdelity term.

Instead of including an explicit regularization in the functional, RED-PRO (Cohen et al., 2021) aims at minimizing the data-ﬁdelity term on the set Fix(D) of ﬁxed points of a generic denoiser D. The study is conducted under the hypothesis that the denoiser is demicontractive, which implies that Fix(D) is convex, thus leading to a convex optimization problem. However, this assumption seems difﬁcult to verify in practice and the existence of ﬁxed points for the RED-PRO operator does not appear straightforward. In contrast, the ﬁxed points of the GS-Pn P operator are directly related to the stationary points of the global functional F = f + λgσ (Lemma 1 in Appendix C), whose existence is guaranteed as soon as F is coercive (see the discussion in Appendix D).

ASYNC-RED (Sun et al., 2020) enables faster computation of RED by decomposing the inference into a sequence of partial (block-coordinate) updates on x which can be executed asynchronously in parallel over a multicore system. As in their previous work BC-RED (Sun et al., 2019a), the authors propose to further reduce the computational time by using only a random subset of measurements at every iteration. Convergence of ASYNC-RED is shown, provided the denoiser is nonexpansive. A possible future extension of our work is the integration of the ASYNC framework to accelerate GS-Pn P for large scale imaging inverse problems. As our GS-Pn P converges without assuming nonexpansiveness of the denoising operation, it would be interesting to see if one can adapt the GS-Pn P convergence properties to an ASYNC-GSPn P algorithm.

B LIPSCHITZ CONSTANT OF gσ

First, let us give a result which ensures that a large class of neural networks trained with differentiable activation functions have Lipschitz gradients with respect to the input image.

Published as a conference paper at ICLR 2022

Proposition 2. Let H = hp . . . h1 be a composition of differentiable functions hi : Rdi 1 Rdi. Let us assume that for any i the differential map h i is bounded and Lipschitz. Then H is Lipschitz.

Proof. Let us denote Hi = hi . . . h1 (and by convention, H0 = Id). Let h i be the best uniform bound on the operator norms h i(x) , x Rdi 1 (which is also the best Lipschitz constant of hi). Let us also denote h i Lip the Lipschitz constant of h i. The chain rule gives that for any x, H (x) can be expressed as a composition of linear maps

H (x) = h p(Hp 1(x))h p 1(Hp 2(x)) . . . h 1(x) (15)

Therefore, for any x, y,

H (x) H (y) =

i=0 h p(Hp 1(x)) . . . h i+2(Hi+1(x))h i+1(Hi(x))h i(Hi 1(y)) . . . h 1(y) (16)

h p(Hp 1(x)) . . . h i+2(Hi+1(x))h i+1(Hi(y))h i(Hi 1(y)) . . . h 1(y). (17)

We can thus bound the operator norms

H (x) H (y)

h p(Hp 1(x)) . . . h i+2(Hi+1(x)) (18)

h i+1(Hi(x)) h i+1(Hi(y)) h i(Hi 1(y)) . . . h 1(y) . (19)

H (x) H (y)

j =i+1 h j h i+1 Lip H i x y (20)

which concludes because the chain-rule ensures that H i h i . . . h 1 .

Proposition 2 applies for a neural network obtained as a composition of fully-connected layers with ELU activation functions, that is, by composing functions of the form

h(x) = E(Ax + b) (21)

where A is a matrix, b a vector and E is the element-wise ELU deﬁned by

E(x)i = xi if xi 0 exi 1 if xi < 0 . (22)

It is easy to see that E is differentiable and that E is 1-Lipschitz with E 1. Therefore

h (x) = E (Ax + b)A (23)

is also bounded and Lipschitz.

Let us also mention that this proposition encompasses the case of U-nets which, in addition to composing fully-connected layers, also integrates skip-connections. For example, taking a skipconnection on a composition h3 h2 h1 amounts to deﬁning

H(x) = h3 h2(h1(x)) , h1(x) . (24)

This can be simply rewritten H = h3 h2 h1 where h2(x) = h2(h1(x)), h1(x) . (25)

It is then clear that h2 has bounded Lipschitz differential as soon as h1 and h2 do.

The bound obtained in the proof of Proposition 2 is exponential in the depth of the neural network. We now provide some experiments showing that, in practice, the Lipschitz constant of gσ does not explode. We show in Figure 3, for various noise levels σ, the distribution of the spectral norms || 2gσ(x)||S on the training image set X, estimated with power iterations. The computed value varies a lot across images. Hence approximating the Lipschitz constant of gσ with L = maxx X || 2gσ(x)||S would lead to under-estimated stepsizes and slow convergence on most images. Backtracking solves this issue by ﬁnding at each iteration the optimal stepsize allowing sufﬁcient decrease of the objective function.

Published as a conference paper at ICLR 2022

Figure 3: Histogram of the values of the spectral norm || 2gσ(x)||S evaluated on 128 128 images from the training dataset, degraded with white Gaussian noise with various standard deviations σ (./255). Figure best seen in color.

C PROOF OF THEOREM 1

We ﬁrst remind that a function f : Rn R + is proper if its domain

dom(f) = {x R, f(x) < + } (26)

is non empty. Also, recall that f is lower semicontinuous at x if lim infx x f(x) f(x ).

(i) For ease of notation, we consider λ = 1. The generalisation for any λ > 0 is straightforward by rescaling gσ (and L) accordingly. We denote the proximal gradient ﬁxed point operator Tτ = Proxτf (Id τ xgσ), the objective function F = f + gσ and we introduce

Qτ(x, y) = gσ(y) + x y, gσ(y) + 1

2τ ||x y||2 + f(x). (27)

arg min x Qτ(x, y) = arg min x gσ(y) + x y, gσ(y) + 1

2τ ||x y||2 + f(x)

= arg min x f(x) + 1

2τ ||x (y τ gσ(y)||2

= Proxτf (Id τ xgσ)(y) = Tτ(y).

By deﬁnition for the arg min, xk+1 = Tτ(xk) Qτ(xk+1, xk) Qτ(xk, xk). Moreover, with gσ being L-smooth, we have by the descent lemma, for any τ 1

L and any x, y Rn,

gσ(x) gσ(y) + x y, gσ(y) + 1

2τ ||x y||2, (29)

so that for every x, y Rn,

Qτ(x, x) = F(x) and Qτ(x, y) F(x). (30)

Therefore, at iteration k,

F(xk+1) Qτ(xk+1, xk) Qτ(xk, xk) = F(xk). (31)

(F(xk)) is thus non-increasing. Since F is lower-bounded, (F(xk)) thus converges to a limit F .

(ii) Note that Qτ(xk+1, xk) Qτ(xk, xk) implies

f(xk+1) f(xk) xk+1 xk, gσ(xk) 1

2τ ||xk+1 xk||2. (32)

Published as a conference paper at ICLR 2022

Using also (29) with stepsize 1

F(xk+1) = f(xk+1) + gσ(xk+1)

f(xk) xk+1 xk, gσ(xk) 1

2τ ||xk+1 xk||2

+ gσ(xk) + xk+1 xk, gσ(xk) + L

2 ||xk+1 xk||2

||xk+1 xk||2.

Summing over k = 0, 1, ..., m gives

k=0 ||xk+1 xk||2 1 1 2τ L

2 (F(x0) F(xm+1))

2 (F(x0) F ) .

Therefore, limk ||xk+1 xk|| = 0.

(iii) We begin by the two following lemmas characterizing the proximal gradient descent operator Tτ = Proxτf (Id τ xgσ).

Lemma 1. With the assumptions of Theorem 1, for x Rn, x is a ﬁxed point of the proximal gradient descent operator Tτ = Proxτf (Id τ xgσ), i.e. Tτ(x ) = x , if and only if x is a stationary point of problem (10), i.e. gσ(x ) f(x ).

Proof. By deﬁnition of the proximal operator, we have

Tτ(x ) = x x = Proxτf (Id τ xgσ)(x ) x τ xgσ(x ) x τ f(x ) xgσ(x ) f(x ). (35)

Lemma 2. With the assumptions of Theorem 1, Tτ is 1 + τL Lipschitz.

Proof. Using the fact that for f convex, Proxτf is 1-Lipschitz (Bauschke & Combettes, 2011, Proposition 12.28), and by the Lipschitz property of xgσ,

||Tτ(x) Tτ(y)|| = ||Proxτf (Id τ xgσ)(x) Proxτf (Id τ xgσ)(y)|| ||(Id τ xgσ)(x) (Id τ xgσ)(y)|| (1 + τL)||x y||. (36)

Note that, by nonconvexity of gσ, the ﬁxed point operator Tτ is not necessarily nonexpansive, but we can still show the convergence of the ﬁxed-point algorithm towards a critical point of the objective function. We can now turn to the proof of (iii). Let x be a cluster point of (xk)k 0. Then there exists a subsequence (xkj)j 0 converging to x . We have j 0,

||x Tτ(x )|| ||x xkj|| + ||xkj Tτ(xkj)|| + ||Tτ(xkj) Tτ(x )||

(2 + τL)||x xkj|| + ||xkj Tτ(xkj)|| by Lemma 2. (37)

Using (ii), the right-hand side of the inequality tends to 0 as j . Thus ||x Tτ(x )|| = 0 and x = Tτ(x ), which by Lemma 1 means that x is a stationary point of problem (10).

Published as a conference paper at ICLR 2022

D ON THE BOUNDEDNESS OF (xk)

In order to obtain convergence of the iterates, in Theorem 2 the generated sequence (xk) is assumed to be bounded. In the experiments (Section 5), we observed that under the rest of assumptions of Theorem 2, boundedness was always veriﬁed. A sufﬁcient condition for the boundedness of the iterates is the coercivity of the objective function, that is, lim|x| F(x) = + (because the non-increasing property gives F(xk) F(x0)).

Similar to Laumont et al. (2021), we can constrain F to be coercive by choosing a convex compact set C Rn where the iterates should stay and by adding an extra term to the regularization gσ:

ˆgσ(x) = gσ(x) + 1

2||x ΠC(x)||2 = 1

2||x Nσ(x)||2 + 1

2||x ΠC(x)||2 (38)

with ΠC the Euclidian projection on C. As gσ is differentiable, the gradient step becomes

(Id τλ xˆgσ)(x) = (Id τλ xgσ) + τλ(x ΠC(x)). (39)

In our experiments, we choose the compact set C as C = [ 1, 2]n. In practice we observe that all the iterates always remain in C and that the extra regularization term ||x ΠC(x)||2 is never activated. Therefore, we don t present this technical adaptation in Algorithm 1 but we let the reader aware that boundedness of (xk) is not a limiting assumption.

E KL PROPERTY

Deﬁnition 1. Kurdyka-Lojasiewicz (KL) property (taken from Attouch et al. (2010))

(a) A function f : Rn R + is said to have the Kurdyka-Lojasiewicz property at x dom(f) if there exists η (0, + ), a neighborhood U of x and a continuous concave function ψ : [0, η) R+ such that ψ(0) = 0, ψ is C1 on (0, η), ψ > 0 on (0, η) and x U [f(x ) < f < f(x ) + η], the Kurdyka-Lojasiewicz inequality holds:

ψ (f(x) f(x ))dist(0, f(x)) 1. (40)

(b) Proper lower semicontinuous functions which satisfy the Kurdyka-Lojasiewicz inequality at each point of dom( f) are called KL functions.

This condition can be interpreted as the fact that, up to a reparameterization, the function is sharp i.e. we can bound its subgradients away from 0. A big class of functions that have the KL-property is given by real semi-algebraic functions. For more details and interpretations, we refer to Attouch et al. (2010) and Bolte et al. (2010).

F ON THE ASSUMPTIONS OF THEOREMS 1 AND 2

In this section, we explicitly list and comment all the assumptions required by Theorems 1 and 2. These assumptions are standard in nonconvex optimization. We now detail why each assumption is veriﬁed for our plug-and-play image restoration algorithm.

Assumptions of Theorem 1:

Data-ﬁdelity term f : Rn R {+ } proper lower semicontinous and convex. This is a general assumption that includes most of the data-ﬁdelity terms classically used in IR problems. Note that we do not require differentiability of f. Degradations with Gaussian, Poisson or Laplacian noise models fall into this hypothesis. As noticed in Remark 3, our results can even be easily extended to a nonconvex data-ﬁdelity term f, which encompasses applications like phase retrieval Metzler et al. (2018). In practice, it is helpful to have f proximable, i.e. Proxf with closed-form formula. Otherwise, Proxf needs to be calculated at each iteration with an optimization algorithm.

Published as a conference paper at ICLR 2022

Regularization function gσ : Rn R proper lower semicontinous and differentiable with L-Lipschitz gradient. We parametrize as gσ(x) = 1 2||x Nσ(x)||2 with Nσ a differentiable neural network. This assumption on gσ is thus reasonable from a practical perspective. Indeed, using a network Nσ with differentiable activation functions, our function gσ is differentiable with Lipschitz gradient (details and proof are given in Appendix B).

Functional F = f + λgσ bounded from below. This is straightforward as all the terms are positive.

The stepsize τ < 1 λL. This is handled by backtracking (see Section 4.2).

Assumptions of Theorem 2:

Assumptions of Theorem 1

F verify the KL property. The KL property (deﬁned in Appendix E) has been widely used to study the convergence of optimization algorithms in the nonconvex setting (Attouch et al., 2010; 2013; Ochs et al., 2014). Very large classes of functions, in particular all the semialgebraic functions, satisfy this technical property. It encompasses all the data-ﬁdelity and regularization terms encountered in inverse problems.

The sequence (xk) given by the iterative scheme (9) is bounded. As discussed in Appendix D, the boundedness can be ensured with a potential additional projection at each iteration. This is just a theoretical guarantee, as we observed that such a projection is never activated in practice.

G BACKTRACKING AND PROOF OF PROPOSITION 1

Before giving the proof of Proposition 1, we ﬁrst point out that our backtracking line search is a classical Armijo-type backtracking strategy, already used for nonconvex optimization in (Beck, 2017, Chapter 10) or Ochs et al. (2014). Other procedures could be investigated in future work. For instance, Li & Lin (2015) uses a Barzilai-Borwein rule to initialize the backtracking line search. Scheinberg et al. (2014) and Calatroni & Chambolle (2019) have also proposed a backtracking strategy that allows for both decreasing and increasing of the stepsize.

We now give the proof of Proposition 1.

Proof. For a given stepsize τ, we showed in Appendix C, equation (33) that

F(xk) F(Tτ(xk)) 1

τ L ||Tτ(xk) xk||2. (41)

Taking τ < 1 2γ

L , we get 1

F(xk) F(Tτ(xk)) > γ

τ ||Tτ(xk) xk||2. (42)

Hence, when τ < 1 2γ

L , the sufﬁcient decrease condition equation (42) is satisﬁed and the backtracking procedure (τ ητ) must end.

In the proof of Theorem 1, we can replace the sufﬁcient decrease (33) by (42) and ﬁnish the proof with the same arguments. In the same way, in the proof of Theorem 2 given in (Attouch et al., 2013, Theorem 5.1), our sufﬁcient decrease (42) replaces (Attouch et al., 2013, Equation (52)).

Published as a conference paper at ICLR 2022

H DRUNET light ARCHITECTURE

The architecture of the DRUNet light denoiser of (Zhang et al. (2021)) is given in Figure 4.

Figure 4: Architecture of the DRUNet light denoiser (Zhang et al. (2021)) used to parameterize Nσ.

I EXPANSIVENESS OF THE DENOISER

As gσ is not necessarily convex, our GS-DRUNet denoiser Dσ = Id gσ is not necessarily nonexpansive and neither is the gradient step Id λτ gσ. This is not an issue as, unlike previous theoretical Pn P studies (Terris et al., 2020; Reehorst & Schniter, 2018), our convergence results do not require a nonexpansive denoising step. To advocate that our method converges without this assumption, we show in Figure 5 the evolution of ||Dσ(xk+1) Dσ(xk)||

||xk+1 xk|| along the algorithm that was run to obtain the super-resolution results of Figure 7. In this experiment, backtracking did not get activated and stayed ﬁxed at λτ = 1. The gradient step in the PGD algorithm was thus simply a denoising step Dσ = Id λτ gσ. Note that the Lipschitz constant of Dσ goes above 1 but convergence is still observed as shown by the two convergence curves in Figure 7.

Figure 5: Lipschitz constant of Dσ along the iterates of the algorithm when performing the superresolution experiments presented Figure 7. Note that the Lipschitz constant goes above 1 i.e. Dσ is not nonexpansive, but we still empirically veriﬁed convergence (see convergence curves Figure 7).

J ADDITIONAL EXPERIMENTS

J.1 DEBLURRING

We give here additional image deblurring experiments. We ﬁrst present the PSNR performance comparison on the Set3c dataset in Table 4. We also provide an evaluation of the 3 best methods (GS-Pn P, DPIR and IRCNN) on the full CBSD68 dataset in Table 5. For fair comparison with RED, we also display in Table 6 the PSNR calculated on the Y channel only. An additional visual

Published as a conference paper at ICLR 2022

comparison is ﬁnally shown in Figure 6. Details and comments are given in the corresponding captions.

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)

EPLL 23.83 24.14 24.83 19.85 26.08 21.77 21.53 21.57 22.43 21.36 22.74 RED 29.21 28.58 29.52 24.54 30.45 25.34 26.06 26.07 25.11 28.50 27.34 IRCNN 33.36 33.06 33.11 32.87 34.24 34.08 33.25 32.87 27.78 29.67 32.45 MMO 32.84 32.29 32.76 31.85 34.08 33.76 33.11 32.38 26.31 29.91 31.93 DPIR 34.94 34.46 34.25 34.34 35.57 35.53 34.49 34.21 28.14 29.63 33.56 GS-Pn P 34.58 34.13 34.04 33.93 35.45 35.25 34.30 33.97 28.16 29.78 33.34

EPLL 21.21 21.10 22.65 18.78 24.12 20.77 20.42 19.89 20.61 20.60 21.02 RED 25.42 24.89 25.69 22.67 26.86 23.84 24.06 23.87 21.49 25.45 24.43 IRCNN 29.08 28.62 29.03 28.46 30.51 30.06 29.23 28.74 24.39 27.39 28.55 DPIR 30.33 29.74 29.87 29.67 31.27 31.08 30.21 29.72 25.02 27.84 29.48 GS-Pn P 30.29 29.84 30.14 29.58 31.53 31.24 30.41 29.96 26.13 28.56 29.77

EPLL 19.84 19.60 21.40 17.71 22.77 19.68 19.02 18.24 19.81 20.12 19.82 RED 21.93 21.27 22.79 20.32 24.01 22.05 22.06 21.41 19.79 23.21 21.88 IRCNN 26.85 26.33 27.04 26.10 28.46 27.90 27.05 26.56 22.90 26.16 26.54 DPIR 27.96 27.37 28.07 27.44 29.42 29.04 28.32 27.56 23.57 26.93 27.57 GS-Pn P 28.08 27.75 28.35 27.56 29.60 29.17 28.49 28.01 24.67 27.47 27.91

Table 4: PSNR(d B) comparison of image deblurring methods on set3C with various blur kernels k and noise levels ν. Best and second best results are displayed in bold and underlined. Similar to Table 2, for all kinds of kernels, the proposed method outperforms all competing methods at noise levels 0.03 and 0.05 and follows DPIR at lower noise level 0.01.

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)

IRCNN 32.47 32.14 31.94 31.97 32.94 33.13 31.92 31.62 27.57 28.45 31.42 DPIR 33.26 32.82 32.48 32.65 33.57 33.85 32.49 32.22 27.65 28.26 31.93 GS-Pn P 32.95 32.54 32.26 32.31 33.41 33.71 32.29 31.92 27.43 28.17 31.70

IRCNN 28.43 28.11 28.28 27.87 29.42 29.21 28.37 27.97 25.52 26.96 28.01 DPIR 28.88 28.53 28.55 28.30 29.58 29.62 28.69 28.28 25.60 26.96 28.30 GS-Pn P 28.64 28.32 28.55 28.06 29.71 29.60 28.69 28.31 25.79 27.10 28.28

IRCNN 26.73 26.42 26.73 26.13 27.69 27.39 26.69 26.33 24.68 26.18 26.40 DPIR 27.04 26.80 27.07 26.53 28.00 27.85 27.17 26.72 24.75 26.32 26.82 GS-Pn P 26.93 26.72 27.07 26.45 28.09 27.87 27.21 26.82 25.02 26.45 26.86

Table 5: PSNR(d B) performance of the fastest method (IRCNN/DPIR/GS-Pn P) for image deblurring on the full CBSD68 dataset with various blur kernels k and noise levels ν, in the same conditions as Table 2. On CBSD10 (Table 2) or on CBSD68 (Table 5), we observed very similar performance gaps between the compared methods, which conﬁrms that CBSD10 is large enough to compare accurately the Pn P methods.

Published as a conference paper at ICLR 2022

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)

2/255 REDY 35.19 34.78 34.58 34.64 35.49 35.65 34.70 34.37 30.15 31.15 34.07 GS-Pn PY 36.20 35.76 35.21 35.55 36.33 36.87 35.16 34.79 29.21 29.48 34.45

0.01 REDY 33.52 33.20 33.20 33.05 34.20 34.21 33.22 32.90 28.77 30.44 32.67 GS-Pn PY 33.88 33.39 33.15 33.15 34.35 34.58 33.22 32.81 28.23 29.12 32.59

0.03 REDY 29.26 28.83 29.28 28.53 30.64 30.48 29.50 29.06 26.18 28.78 29.05 GS-Pn PY 29.45 29.11 29.39 28.82 30.55 30.46 29.55 29.14 26.50 28.02 29.01

0.05 REDY 26.91 26.54 27.52 26.23 28.68 28.25 27.67 27.07 25.36 27.98 27.22 GS-Pn PY 27.65 27.48 27.88 27.19 28.90 28.67 28.03 27.59 25.62 27.29 27.63

Table 6: PSNR(d B) performance, evaluated on the luminance channel in Ycb Cr color space, of RED and GS-Pn P for image deblurring on CBSD10. Remind that our method treats the RGB image as a whole before being evaluated on the Y channel while RED treats the Y channel independently. Compared to Table 2, we add the case ν =

2/255 as in RED original paper (Romano et al. (2017)). Note that RED was optimized for kernels (i) and (j) and ν =

2/255, and outperforms our method in this set of conditions. However, over the variety of kernels and noise levels, and in particular for motion blur, our method generally outperforms RED.

(a) Observed (18.31d B)

(b) RED (28.43d B)

(c) IRCNN (30.61d B)

(d) MMO (30.23d B)

(e) DPIR (30.85d B)

(f) GS-Pn P (30.74d B)

(h) γk (log scale)

Figure 6: Deblurring with various methods of an image from CSBD10 degraded with the indicated blur kernel and input noise level ν = 0.01. In (g) and (h), we show the evolution of F(xk) and γk = min0 i k ||xi+1 xi||2/||x0||2 along our algorithm. Note that GS-Pn P and DPIR both recover ﬁne textures while other methods tend to smooth details.

Published as a conference paper at ICLR 2022

J.2 SUPER-RESOLUTION

We also present additional super-resolution experiments. We realize a full PSNR performance comparison on the Set3c dataset Table 7. We show additional visual comparisons between methods Figure 7 and Figure 8. Details and comments are given in the corresponding captions.

Kernels Method s = 2 s = 3 Avg ν = 0.01 ν = 0.03 ν = 0.05 ν = 0.01 ν = 0.03 ν = 0.05

Bicubic 21.92 21.54 20.90 19.76 19.53 19.11 20.46 RED 28.22 25.62 23.61 24.91 23.38 21.82 24.59 IRCNN 28.35 26.40 25.27 25.61 24.45 23.37 25.58 DPIR 29.08 27.27 26.21 26.55 25.33 24.41 26.48 GS-Pn P 29.24 28.03 26.65 25.90 25.56 24.60 27.00

Bicubic 19.82 19.58 19.16 18.95 18.76 18.40 19.11 RED 24.72 22.55 21.10 22.82 21.64 20.19 22.17 IRCNN 25.10 23.44 22.52 24.25 22.60 21.58 23.25 DPIR 26.22 24.52 23.56 25.34 23.57 22.50 24.29 GS-Pn P 25.45 24.84 23.80 24.53 23.73 22.71 24.18

Table 7: PSNR(d B) comparison of image super-resolution methods on set3C with various scales s, blur kernels k and noise levels ν. Similar to Table 3, for isotropic and anisotropic kernels, the proposed method outperforms all competing methods at noise levels 0.03 and 0.05 and follows DPIR at lower noise level 0.01.

Kernels Method s = 2 s = 3 Avg ν = 0.01 ν = 0.03 ν = 0.05 ν = 0.01 ν = 0.03 ν = 0.05

IRCNN 26.97 25.86 25.45 25.60 24.72 24.38 25.50 DPIR 27.79 26.58 25.83 26.05 25.27 24.66 26.03 GS-Pn P 27.88 26.81 26.01 25.97 25.35 24.74 26.13

IRCNN 25.41 24.52 24.18 24.94 24.04 23.61 24.45 DPIR 26.08 24.99 24.39 25.53 24.46 23.80 24.88 GS-Pn P 25.98 25.07 24.53 25.47 24.56 23.92 24.92

Table 8: PSNR(d B) performance of the fastest method (IRCNN/DPIR/GS-Pn P) for image superresolution on the full CBSD68 dataset with various blur kernels k and noise levels ν, in the same conditions as Table 3. Once again, on CBSD10 (Table 2) or on CBSD68 (Table 5), we observed very similar performance gaps between the compared methods, which again conﬁrms that CBSD10 is large enough to compare accurately the Pn P methods.

Published as a conference paper at ICLR 2022

(b) Observed

(c) RED (21.28d B)

(d) IRCNN (23.15d B)

(e) DPIR (23.33d B)

(f) GS-Pn P (23.47d B)

(h) γk (log scale)

Figure 7: Super-resolution with various methods on a CBSD10 image degraded with the indicated blur kernel, s = 2 and input noise level ν = 0.05. In (g) and (h), we show the evolution of F(xk) and γk = min0 i k ||xi+1 xi||2/||x0||2 along our algorithm. One can notice that the proposed method GS-Pn P manages to extract more structure in the zoomed area than the competing methods.

(b) Observed

(c) RED (25.40d B)

(d) IRCNN (25.42d B)

(e) DPIR (25.41d B)

(f) GS-Pn P (25.46d B)

(h) γk (log scale)

Figure 8: Super-resolution with various methods on a CBSD10 image degraded with the indicated blur kernel, s = 3 and input noise level ν = 0.01. In (g) and (h), we show the evolution of F(xk) and γk = min0 i k ||xi+1 xi||2/||x0||2 along our algorithm.

Published as a conference paper at ICLR 2022

J.3 INPAINTING (WITH NON-DIFFERENTIABLE DATA-FIDELITY TERM)

We now propose to apply our Pn P scheme to image inpainting with the degradation model

y = Ax (43)

where A is a diagonal matrix with values in {0, 1}. For inpainting, no noise is added to the degraded image. In this context, the data-ﬁdelity term is the indicator function of A 1({y}) = {x | Ax = y}: f(x) = ıA 1({y}) (which, by deﬁnition, equals 0 on A 1({y}) and + elsewhere). Despite being non differentiable, f still veriﬁes the assumptions of Theorems 1 and 2 and convergence is theoretically ensured. The proximal map becomes the orthogonal projection ΠA 1({y})

Proxτf(x) = ΠA 1({y})(x) = Ay Ax + x (44)

In our experiments, the diagonal of A is ﬁlled with Bernoulli random variables with parameter p = 0.5. We run our Pn P algorithm with σ = 10/255. Given the form of f, we do not use the backtracking strategy and keep a ﬁxed stepsize. Even if we do not exactly know the Lipschitz constant of gσ, we observed in Figure 5 that, for small noise, it was almost always estimated as slightly larger than 1. We thus choose λτ = 1 and empirically conﬁrm convergence with this choice in follow-up experiments (see Figure 9). The algorithm is initialized with x0 = y + 0.5(Id A)y (masked pixels with value 0.5) and terminates when the number of iterations exceeds K = 100. We found it useful to run the ﬁrst 10 iterations of the algorithm at larger noise level σ = 50/255. As y does not have noise, we found preferable not to run the last extra gradient pass from Algorithm 1.

We show inpainting results on set3C images Figure 9. Our Pn P restores the input images with high accuracy, including its small details. Furthermore, convergence of the residual at rate O( 1

k) is empirically conﬁrmed.

GS-Pn P (31.65d B)

(a) γk (log scale)

GS-Pn P (33.65d B)

γk (log scale)

GS-Pn P (33.71d B)

γk (log scale)

Figure 9: Inpainting results on set3C with pixels randomly masked with probability p = 0.5. In the last colomn, we show the evolution of γk = min0 i k ||xi+1 xi||2/||x0||2 along the iterations.

Published as a conference paper at ICLR 2022

J.4 PSNR CONVERGENCE CURVES

We ﬁrst plot Figure 10 the evolution of the PSNR along the iterations of the Pn P algorithm during the experiments of Figure 1 and Figure 2. This illustrates that the minimization of F coincides with the maximization of the PSNR, which supports the interest of the optimized functional F = f +λgσ.

(a) Deblurring

(b) Super-resolution

Figure 10: Evolution of the PSNR along the iterations of the algorithm, during (a) the deblurring experiment of Figure 1 and (b) the super-resolution experiment of Figure 2. Note that the convergence in PSNR follows the convergence in function value (represented in Figures 1(g) and 2(g)).

J.5 INFLUENCE OF THE PARAMETERS

In this section we study more deeply the inﬂuence of the parameters involved in the GS-Pn P algorithm. Three parameters are involved: the stepsize τ, the denoiser level σ and the regularization parameter λ.

The stepsize τ is automatically tuned with backtracking and is not tweaked heuristically, contrary to other competing methods based on Pn P-HQS.

The ﬁrst regularization parameter σ is linked to the used denoiser.

The second regularization parameter λ is introduced so as to target the objective function f + λgσ, which is the main purpose of our method. It is a classical formulation of inverse problems, and the trade-off parameter λ is usually tuned manually.

Thus, like Pn P-HQS, we have two parameters that we are free to tune manually. One additional motivation for keeping both λ and σ as regularization parameters is to be able to use our Pn P algorithm with noise-blind denoisers like Dn CNN that are independent on σ. In practice, in our experiments, we ﬁrst roughly estimated σ proportionally to the input noise level ν and tweaked λ more precisely. Note that, for each inverse problem, our parameters λ and σ are ﬁxed for a large variety of kernels, images and noise levels ν. The parameters are not optimized for each image.

Figure 11 and Figure 12 respectively plot the average PSNR when deblurring the CBSD10 images with different values λν and σ/ν, and ﬁxed ν = 0.03. Both parameters control the strength of the regularization. We observe that λν and σ have a similar inﬂuence on the output: for small λν or small σ, the regularization involved by the denoising pass is not sufﬁcient to counteract the noise ampliﬁcation done by the proximal steps with large τ (recall that when τ , Proxτf tends to the pseudo-inverse of A). On the contrary, as expected, increasing λ and σ tends to over-smooth the output result.

On a single image, we also vary the main parameters of both GS-Pn P and RED (Figure 13). For fair comparison, the PSNR is computed on the luminance channel only. This experiment conﬁrms that, when manually optimizing the parameters for both methods, the PSNR results obtained with GS-Pn P and RED remain close, as already observed in Table 6.

Published as a conference paper at ICLR 2022

Figure 11: Inﬂuence of the choice of the parameter λν for deblurring. Top: average PSNR when deblurring the images of CBSD10, blurred with motion blurs or static blurs, for different values of λν. The other parameters remain unchanged. Bottom: visual results when deblurring starﬁsh with various λν (in the same conditions as Figure 1).

Figure 12: Inﬂuence of the choice of the parameter σ for deblurring. Top: average PSNR when deblurring the images of CBSD10, blurred with the 10 kernels, for different values of σ/ν, with ν = 0.03. The other parameters remain unchanged. Bottom: visual results when deblurring starﬁsh with various σ/ν, with ν = 0.03 (in the same conditions as Figure 1).

Published as a conference paper at ICLR 2022

Figure 13: Inﬂuence of the parameters σ and λ for RED and GS-Pn P when deblurring the single image starﬁsh degraded with uniform kernel and ν = 7.65/255. For fair comparison, like in Table 6, the PSNR is calculated on the Y channel only. Remember that the results of Table 6 were obtained with σ = 3.25, λ = 0.02 for RED and σ = 2ν, λ = 0.075 for GS-Pn P.

J.6 INFLUENCE OF THE INITIALIZATION

In Figure 14, we examine the robustness of the method to the initialization. As can be seen on this experiment, the output image does not change much even for relatively large perturbation of the initialization. We thus observe a robustness to the initialization, both in terms of visual aspect and PSNR. We also observed that initializing with a uniform image does not change the output of the algorithm. We suggest that this robustness comes from the ﬁrst proximal steps on the data-ﬁdelity term (with a large τ), which prevent the algorithm to be stuck in a poor local minimum. Note that the use of large τ in the beginning of the algorithm is possible thanks to the backtracking procedure.

σinit = 20/255

σinit = 40/255

σinit = 60/255

Figure 14: Inﬂuence of the initialitation z0 on the deblurring result. Instead of initializing with the blurred image z0 = y as done in Section 5.2.1, we set z0 = y + ξσinit with ξσinit an AWGN with standard deviation σinit. By increasing the noise level σinit, we investigate the robustness of the result to changes in the initialization of the algorithm. Top: PSNR values, along with values of σinit. Bottom: corresponding visual results on starﬁsh with various σinit (in the same conditions as Figure 1). The algorithm is robust to noisy initializations up to a relatively large value of σinit.

Published as a conference paper at ICLR 2022

J.7 CONVERGENCE OF DPIR (ZHANG ET AL. (2021))

In this section, we illustrate that, contrary to our method, the DPIR algorithm is not guaranteed to converge and can even easily diverge. In Figure 15, we plot the convergence curves of both DPIR and our GS-Pn P when deblurring the starﬁsh image degraded with a motion kernel and ν = 0.01. In the original DPIR paper Zhang et al. (2021), only 8 iterations are used with decreasing τ and σ. More precisely, σ decreases uniformly in log-scale from 49 to the input noise level ν, and τ is set proportional to σ2. In order to study the asymptotic behaviour of the method, we propose two strategies to run DPIR with 1000 iterations:

(i) Decreasing σ from 49 to ν over 1000 iterations instead of 8 (Figure 15, row 1).

(ii) Decreasing σ from 49 to ν in 8 iterations, and then keep the last values of σ and τ for the the remaining iterations (Figure 15, row 2).

As illustrated by the plot of P

i k ||xi+1 xi||2 (third column), DPIR fails to converge with both strategies, even if the residual ||xk+1 xk||2 tends to decrease with the second strategy. This divergence also involves a loss of restoration performance in terms of PSNR (ﬁrst column). On the other hand, as theoretically shown in this paper, the residual ||xk+1 xk||2 with GS-Pn P tends to 0 (reaches 10 13 before the activation of backtracking, versus 10 4 for DPIR) and its series converges.

||xk+1 xk||2/||x0||2 (log scale)

i k ||xi+1 xi||2/||x0||2

Figure 15: Convergence of the DPIR algorithm versus convergence of GS-Pn P when deblurring the starﬁsh image. The two ﬁrst rows display results obtained with DPIR with two different strategies used for decreasing σ: in the ﬁrst row, σ is decreased from 49 to ν over 1000 iterations; in the second row, σ is decreased in 8 iterations and then kept ﬁxed for the remaining iterations.