# sparse_coding_with_gated_learned_ista__96c84acb.pdf Published as a conference paper at ICLR 2020 SPARSE CODING WITH GATED LEARNED ISTA Kailun Wu Department of Automation, Tsinghua University Beijing, P.R.China wukl14@mails.tsinghua.edu.cn &kailun.wukailun@alibaba-inc.com Yiwen Guo Bytedance AI Lab Beijing, P.R.China guoyiwen.ai@bytedance.com Ziang Li Department of Automation, Tsinghua University Beijing, P.R.China liza19@mails.tsinghua.edu.cn Changshui Zhang Department of Automation, Tsinghua University Beijing, P.R.China zcs@mail.tsinghua.edu.cn In this paper, we study the learned iterative shrinkage thresholding algorithm (LISTA) for solving sparse coding problems. Following assumptions made by prior works, we first discover that the code components in its estimations may be lower than expected, i.e., require gains, and to address this problem, a gated mechanism amenable to theoretical analysis is then introduced. Specific design of the gates is inspired by convergence analyses of the mechanism and hence its effectiveness can be formally guaranteed. In addition to the gain gates, we further introduce overshoot gates for compensating insufficient step size in LISTA. Extensive empirical results confirm our theoretical findings and verify the effectiveness of our method. 1 INTRODUCTION Sparse coding serves as the foundation of many machine learning applications, e.g., the direction-ofarrival estimation (Xu et al., 2012), signal denoising (Elad & Aharon, 2006), and super resolution imaging (Yang et al., 2010). In general, it aims to recover an inherently sparse vector xs Rn from an observation y Rm corrupted by a noise vector ε Rm. That is, y = Axs + ε, (1) in which A Rm n is an over-complete basis matrix. The problem of recovering xs, however, is a challenging task, in which the main difficulties are to incorporate the sparse constraint which is nonconvex and to further determine the indices of its non-zero elements, i.e., the support of the vector. A reasonable solution to the problem is to use convex functions as surrogates to relax the constraint of sparsity, among which the most classical one probably is the l1-norm penalty. Such a problem is carefully studied in Lasso (Tibshirani, 1996), and it can be solved via least angle regression (Efron et al., 2004), the iterative shrinkage and thresholding algorithm (ISTA) (Daubechies et al., 2004), etc. Despite the simplicity, these conventional solvers suffer from critical shortcomings. Taking ISTA as an example, we know that 1) it converges very slowly with only a sublinear rate (Beck & Teboulle, 2009), 2) the correlation between each of the two columns of A should be relatively low. In recent years, deep learning (Le Cun et al., 2015) methods have achieved remarkable successes. Deep neural Kailun Wu and Yiwen Guo contribute equally to the paper. Kailun Wu, Ziang Li, and Changshui Zhang are with the Institute for Artificial Intelligence (THUAI), the State Key Lab of Intelligent Technologies and Systems, the Beijing National Research Center for Information Science and Technology (BNRist), and the Department of Automation, Tsinghua University. Kailun Wu did his work when he was a Ph.D. student at Tsinghua Unversity, and now he works at Alibaba Group. This work is (jointly or partly) funded by the NSFC (Grant No. 61876095) and Beijing Academy of Artificial Intelligence (BAAI). Published as a conference paper at ICLR 2020 networks (DNNs) have been proven both effective and efficient in dealing with many tasks, including image classification (He et al., 2016), object detection (Girshick, 2015), speech recognition (Hinton et al., 2012), and also sparse coding (Gregor & Le Cun, 2010; Wang et al., 2016; Borgerding et al., 2017; He et al., 2017; Zhang & Ghanem, 2018; Chen et al., 2018; Liu et al., 2019; Sulam et al., 2019). The core idea behind deep learning-based sparse coding is to train DNNs to approximate the optimal sparse code. For instance, an initial work of Gregor and Le Cun s (2010) takes the inspiration from ISTA and develops an approximator named learned ISTA (LISTA), which is structurally similar to a recurrent neural network (RNN). It has been demonstrated both empirically and theoretically that LISTA is superior to ISTA (Wang et al., 2016; Moreau & Bruna, 2017; Giryes et al., 2018; Chen et al., 2018). Nevertheless, it is also uncontroversial that there exists much room for further enhancing it. In this paper, we delve deeply into the foundation of (L)ISTA and discover possible weaknesses of LISTA. First and foremost, we know from prior arts (Chen et al., 2018; Liu et al., 2019) that LISTA tends to learn large enough biases to achieve no false positive in the support of generated codes and further ensure linear convergence, and we prove that this tendency, however, also makes the magnitude of the code components being lower than that of the ground-truth. That said, there probably exists a requirement of gains in the code estimations. Second, regarding the optimization procedure of ISTA as to minimize an upper bound of its objective function at each step, we conjecture that the element-wise update of (L)ISTA normally lags behind the optimal solution, which suggests that it requires overshoots to reach the optimum, just like what has been suggested in fast ISTA (FISTA) (Beck & Teboulle, 2009) and learned FISTA (LFISTA) (Moreau & Bruna, 2017). In this paper, our main contributions are summarized as follows: We discover weaknesses of LISTA by theoretically analyzing its optimization procedure, for mitigating which we introduce gain gates and overshoot gates, akin to update gate and reset gate mechanisms in the gated recurrent unit (GRU) Cho et al. (2014). We provide convergence analyses for LISTA (with or without gates), which further give rise to conditions on which the performance of our method with gain gates can be guaranteed. A practical case is considered, where the assumption of no false positive is relaxed. Insightful expressions for the gates are presented. In comparison with state-of-the-art sparse coding networks (not limited to previous extensions to LISTA), our method achieves superior performance. It also applies to variants of LISTA, e.g., LFSITA (Moreau & Bruna, 2017) and ALISTA (Liu et al., 2019). Notations: In this paper, unless otherwise clarified, vectors and matrices are denoted by lowercase and uppercase characters, respectively. For vectors/matrices originally introduced without any subscript, adding a subscript (e.g., i) indicates its element/column at the corresponding position. For instance, for x Rn, xi represents the i-th element of the vector, and W:,i and Wi,: denote the i-th column and row of a matrix W respectively. While for vectors introduced with subscripts already, e.g., xs, we use (xs)i to denote its i-th element. The operator is used to indicate element-wise multiplication of two vectors. The support of a vector is denoted as supp(x) := {i|xi = 0}. We use supxs as the simplified form of supxs X(B,s,0), see Assumption 1 for the definition of X(B, s, 0). 2 BACKGROUND In general, sparse coding solves the problem that can be formulated as min x f(x, y) + λr(x), (2) in which f(x, y) calculates the residual of approximating y using a linear combination of column-wise features in A. The function f(x, y) is convex with respect to x in general. In particular, if ε is a Gaussian vector, then it should be f(x, y) = Ax y 2 2. The term λr(x) serves as a regularizer for sparsity and we have r(x) = x 1 in Lasso. As mentioned, a variety of algorithms can be applied to solve the problem and our focus in the paper is (L)ISTA. We first revisit the optimization procedure of ISTA, which is the foundation of LISTA as well. Given y, let us introduce a scalar γ > 0 that fulfills γI 2 xf(x, y) 0, x, then it can be considered as optimizing an upper bound of the objective Published as a conference paper at ICLR 2020 function obtained via Taylor expansion. To be more specific, for any presumed x(t), we have f(x, y) + λr(x) f(x(t), y) + (x x(t)) xf(x(t)) + γ 2 x x(t) 2 + λr(x). (3) By substituting r(x) with x 1 and optimizing the bound in an element-wise manner, we can easily get the one-step update rule that zeros the gradient based on x(t). It is, x(0) = 0 and x(t+1) = sλ/γ(x(t) xf(x(t))/γ), t 0, (4) in which sb(x) := sign(x)(|x| b)+ is a shrinking function and ( )+ is a rectified linear unit (Re LU) calculating max{0, }. For Gaussian noises, the formulation reduces to x(t+1) = sλ/γ The update as shown in Eq. (4) and (5) can be performed iteratively until convergence. However, the convergence of ISTA (along with some other conventional solvers) is known to be slow, and it has been shown that DNNs can be utilized to accelerate the procedure. Many researchers have explored the idea since the initial work of Gregor and Le Cun s (i.e., LISTA). For LISTA, they design deep architectures following the main procedure of ISTA yet to learn parameters in an end-to-end manner from data (Gregor & Le Cun, 2010; Hershey et al., 2014). The inference process of LISTA is similar to that of an RNN and can be formulated as x(0) = 0 and x(t+1) = sb(t)(W (t)x(t) + U (t)y), t = 0, , d 1, (6) where Θ = {U (t), W (t), b(t)}t=0,1,...,d 1, is learnable parameters set. Some works (Xin et al., 2016; Chen et al., 2018) have proved that W (t) and U (t) should satisfy the constraint W (t) = I U (t)A, such that x(t+1) = sb(t)(x(t) + U (t)(Ax(t) y)), t = 0, , d 1. (7) The parameters in Θ are normally learned from a set of training samples by minimizing the difference between the final code estimations and ground-truth. In this paper, our main assumption for theoretical analyses follows those of prior works (Chen et al., 2018; Liu et al., 2019) in a noiseless case, and noisy cases will be considered in the experiments. Assumption 1. The sparse vector xs and noise vector ε are sampled from a set X(B, s, 0) fulfilling: X(B, s, 0) := {x x B, ε = 0, x 0 s}. 3 SPARSE CODING WITH GAIN GATES AND OVERSHOOT GATES In this section, we will introduce the advocated gain gates and overshoot gates. Along with thorough discussions for the motivations, their formulations are provided in Section 3.1 and 3.2, respectively. Figure 1 summarizes the inference process of the standard LISTA and two evolved versions with our gates incorporated. Proofs of all our theoretical results are deferred to the appendix. Figure 1: The inference process of the standard LISTA and evolved versions with our gates 3.1 SPARSE CODING WITH GAIN GATES Recent works have shown linear convergence of LISTA (Chen et al., 2018; Liu et al., 2019). In order to guarantee the convergence, it is also demonstrated that the value of bias terms should be large enough to eliminate all false positive in the support of the generated codes. However, this may lead to an issue that the magnitude of the generated code components in LISTA must be smaller than or at most equal to those of the ground-truth. Our result in Proposition 1 makes this formal. For clarity of the result, we would like to introduce the following definition first. Published as a conference paper at ICLR 2020 Definition 1. (Liu et al., 2019) Given a matrix A Rm n, its generalized mutual coherence is: µ(A) := inf W Rn m,Wi,:A:,i=1, i n max i =j,1 i,j n Wi,:A:,j o . (8) We let W(A) denote a set of all matrices that can achieve the generalized mutual coherence µ(A), which means: W(A) := n W max i =j,1 i,j n Wi,:A:,j = µ(A), Wi,:A:,i = 1, i o . (9) Proposition 1. (Requirement of gains). With U (t) W(A) and W (t) = I U (t)A, if b(t) = µ(A) supxs x(t) xs 1 is achieved in LISTA to guarantee no false positive (i.e., supp(x(t)) supp(xs)) and further linear convergence (i.e., x(t) xs 2 s B exp(ct), in which c = log((2s 1)µ(A))), then we have for the estimation |x(t) i | |(xs)i| and x(t) i (xs)i 0, i supp(xs). Provided Proposition 1 as the evidence of a potential weakness of LISTA, we believe that if the code components can be enlarged appropriately, then the estimation at each step would be closer to xs, and the convergence of LISTA will be further improved, which inspires us to design a gate to enlarge the generated code components. Such a gate is named as a gain gate and it acts on the input to the current estimation, akin to a reset gate in GRU (Cho et al., 2014), which is x(t+1) = sb(t)(W (t)(gt(x(t), y|Λ(t) g ) x(t)) + U (t)y), (10) in which the gate function gt( , |Λ(t) g ) outputs an n-dimensions vector, and Λ(t) g is the set of its learnable parameters. In the original implementation of LISTA, the output of each layer is obtained by calculating Eq. (4) iteratively. It has been proven that the estimation x(t) ultimately converges to the ground-truth xs (as t ), only if the condition of (W (t) (I U (t)A)) 0 holds. That said, it is suggested that U (t) and W (t) are entangled to the end. Yet, with our gated mechanism, the update rule in neural networks has been modified into Eq. (10), making it unclear whether the convergence is guaranteed similarly or not. To figure it out, we perform theoretical analyses in depth, which will further provide guidance for the gate design. We are going to explore: whether the learnable matrices are still entangled as in LISTA, and to encourage fast convergence, what properties should the gate function satisfy? Theorem 1 and 2 give some answers to these questions and they are based on the same assumptions as for Proposition 1. Theorem 1. If the s-th principal minor of W (t) have full rank, then for the gate function bounded from both above and below, we have xs as the fixed point of Eq. (10) only if diag(gt(x(t), y|Λ(t) g )) D and W (t)D (I U (t)A) 0, as t , (11) in which D is an n n constant diagonal matrix and the function diag( ) creates a diagonal matrix with the elements of its input on the main diagonal. From Theorem 1 we can equivalently have ( W (t) (I U (t)A)) 0 by defining W (t) := W (t)D, which means the learnable matrices are similarly entangled as in the standard LISTA. Besides, we know that as the number of layers increases, each introduced gain gate should ultimately converge to a constant (diagonal) matrix D to guarantee performance. Then if W (t) I U (t)A, the gain gate function converges to an identical mapping as t , and vice versa. This inspires us to split the gate function into an identical one and a residual one, and we thus advocate, for each index i of the vector, the i-th element of gain gate is gt(x(t), y|Λ(t) g )i = 1 + κt(x(t), y|Λ(t) g )i and κt(x(t), y|Λ(t) g )i 0, (12) in which κt(x(t), y|Λ(t) g )i is the i-th element of κt(x(t), y|Λ(t) g ), and it should decrease as t increases, in order to guarantee convergence in Eq. (11). Let us further study the convergence rate of LISTA equipped with such gain gates. For clarity, we introduce another condition for the function before moving to more details: κt(x(t), y|Λ(t) g )i < 2b(t 1) i /|x(t) i |. (13) We present theoretical results as follows on the basis of Proposition 1, i.e., we still have U (t) W(A), W (t) = I U (t)A and Assumption 1, but the requirement for b(t) is different. Published as a conference paper at ICLR 2020 Theorem 2. If b(t) = µ(A) supxs xs x(t) gt(x(t), y|Λ(t) g ) 1 is achieved, following the update rule in Eq. (10), if the conditions in Eq. (12) and (13) hold for the gate function, there will be x(t) xs 2 s B exp( i=1 ci + c), (14) in which c = log((2s 1)µ(A)), ci = c if i log( s B xs 1 )/ log( 1 (2s 1)µ(A)) , and ci < c otherwise. Theorem 2 presents an upper bound of x(t) xs 2 for LISTA with gain gates, and it shows that so long as the gates satisfying conditions in Eq. (12) and (13) are introduced, the convergence factor c + P ci of our gated LISTA would be smaller in comparison with that of the standard LISTA (which is ct, see Proposition 1 and Chen et al. s work 2018). By consolidating all these theoretical cues, we further give principled expressions for the gate function. One may expect to endow the gates some learning capacities, thus we let gt(x(t), y|Λ(t) g ) = 1 + κt(x(t), y|Λ(t) g ) = 1 + µtb(t 1)ft(x(t)|νt), (15) in which µt R is a parameter to be learned, b(t 1) is the threshold parameter of the (t 1)-th layer, and ft(x(t)|νt) is a newly introduced function constrained not to be greater than 2/|x(t)|. We are going to evaluate different choices for the function ft(x(t)) in experiments, e.g., the piece-wise linear function: ft(x(t)|νt) = Re LU(1 Re LU(νt|x(t)|)), the inverse proportional function: ft(x(t)|νt) = 1/(νt|x(t)| + ϵ), the exponential function: ft(x(t)|νt) = exp( νt|x(t)|), in which νt R is a parameter to be learned, and ϵ is a tiny positive scalar introduced to avoid zero being divided. All the learnable parameters in a gain gate are thus collected as Λ(t) g = {µt, νt}. 3.1.1 NO FALSE POSITIVE? Our previous theoretical results show that the performance of LISTA can be improved by using a gain gate, as long as the gate function satisfies conditions in Eq. (12) and (13), and no false positive is encountered. However, it is not always true in practice. Our experimental results also show that when the inverse proportional function is adopted as gain gates in lower layer for LISTA, the performance of our gated LISTA may even degrade. We conjecture that such contradiction to the theoretical results may be owing to impractical assumptions. In this subsection, we try to relax the assumption about no false positive , and we further found that a tighter bound can be achieved with a more reasonable assumption instead. Through theoretical analyses as follows, we also demonstrate that the inverse proportional gain function should better be only adopted in higher layers. For clarity of the results, we would like to introduce the following definition first. Definition 2. Given a model with Θ, in which b(t) = Γµ(A) supxs x(t) gt(x(t), y|Λ(t) g ) xs 1, we introduce ωt+1(k|Θ) to characterize its relationship with the false positive rate, which is ωt+1(kt+1|Θ) = sup xs,|supp(ˇx(t+1)) supp(xs)| |supp(xs)|+kt+1 Γ, in which ˇx(t+1) := sb(t)(W (t)(x(t) gt(x(t), y|Λ(t) g ) xs)), and kt+1 0 is the desired maximal number of false positive of x(t+1). The above definition applies to both the standard LISTA and LISTA with gain gates (we can let the gate function be an identity function to achieve a standard LISTA). We first analyze the convergence of LISTA without gates. We present theoretical results as follows on the basis of similar assumptions (including Assumption 1, U (t) W(A), and W (t) = I U (t)A), but with a different requirement for b(t) from Proposition 1. Theorem 3. If b(t) = ωt+1(kt+1|Θ)µ(A) supxs x(t) xs 1 is achieved, and 0 < k(t) 0 < s such that ωt(k(t) 0 |Θ) < 1 1/(s k(t) 0 ), then there exists false positive with 0 < kt < s and x(t) xs 2 s B exp( in which c i < log((2s 1)µ(A)). Published as a conference paper at ICLR 2020 It can be seen that when we relax the assumption about no false positive and further reduce the value of the threshold b(t), the error bound of LISTA becomes even lower. Obviously, the previous bound of LISTA with gain gates in Theorem 2 is not necessarily lower than the tighter bound of a standard LISTA in Theorem 3, which well explains the contradiction of theoretical and empirical results. Here we re-deduce the error bound of our gated LISTA with the inverse proportional function in the following theorem. Note that we still have U (t) W(A), W (t) = I U (t)A and Assumption 1. Theorem 4. Suppose that mini supp(xs) |(xs)i| σ > 0, if b(t) = ωt+1(kt+1|Θ)µ(A) supxs x(t) gt(x(t), y|Λ(t) g ) xs 1 is achieved and 0 < k(t) 0 < s such that ωt(k(t) 0 |Θ) < 1 1/(s k(t) 0 ), then x(t) xs 2 s B exp( i=1 c i + c t ), in which c t < log((2s 1)µ(A)). t0 = log( s B σ )/ log( 1 (2s 1)µ(A)) if the scaling factor µi of the gate has µi = 0 for i t0, 0 < ki < s, then c i = c i , and if 1 ωi(s|Θ) < µi 1 for i > t0, ki = 0, then c i < c i . We can conclude from Theorem 4 that, a) a gain gate expressed by the inverse proportional function should be applied to deeper layers in LISTA, rather than lower layers, b) when using the function, there indeed exists no false positive (i.e., ki = 0) in deeper layer. We follow such guidelines in the implementation of our gated LISTA. In addition, we observe that unlike the inverse proportional function, other considered functions show consistent performance gains on both lower and higher layers, hence we attempt to utilize them on lower layers in alliance with the inverse proportional function powered gain gates on the other layers. In practice, we choose the Re LU-based piece-wise linear function, and it is uniformly applied to the first 10 layers. We will empirically compare different choices between the gain gate functions in Section 4.1. 3.2 SPARSE CODING WITH OVERSHOOT GATES Unlike the gain gates that are incorporated before performing estimation at each step, the overshoot gates act more like adjustments to the outputs, which can be viewed as learnable boosts: x(t+1) = sb(t)(W (t)x(t) + U (t)y), x(t+1) = ot(x(t), y|Λ(t) o ) x(t+1) + (1 ot(x(t), y|Λ(t) o )) x(t). (17) The gate function ot( , |Λ(t) o ) : {Rn, Rm} Rn outputs an n-dimensional vector and Λ(t) o collects all the trainable parameters in the function, akin to a dedicated update GRU gate (Cho et al., 2014). Our motivation comes from analyses of ISTA, whose update can be viewed as x(t) +η(x(t+1) x(t)), in which η = 1 is a constant step size. We argue that η = 1 may not be the most suitable choice and the following proposition makes this formal. We have it to theoretically analyze the update rule of ISTA and η := arg minη f(η(x(t+1) x(t)) + x(t), y) + λ η(x(t+1) x(t)) + x(t) 1. Proposition 2. (Requirement of overshoots) For minx f(x, y) + λ x 1, in which f(x, y) is convex with respect to x and γI 2 xf(x) 0 holds for all x, if the update rule in Eq. (4) is adopted, then we have η 1. In addition, if supp(x(t)) supp(x(t+1)), then we further have η > 1. See also Figure 2 for an illustration of the issue with η = 1 as concerned. Since the optimization procedure of ISTA inspires the network architecture in LISTA, the theoretical result in Proposition 2 that requires a boost in η for superior performance also inspires us to design specific overshoot gates for LISTA. Having noticed that an essential principle we have obtained is to let η 1 (or η > 1), we may expect the output of the gate function to be greater than or at least equal to 1. To achieve the goal, we can try different expressions for it, e.g., the sigmoid-based function: ot(x(t), y|Λ(t) o ) = 1 + aoσ(Wox(t) + Uoy) the inverse-proportional-based function: ot(x(t), y|Λ(t) o ) = 1 + ao | x(t+1) x(t)| + ϵ, Published as a conference paper at ICLR 2020 with σ( ) being the sigmoid function, Λ(t) o = {ao, Wo, Uo} and Λ(t) o = {ao} for the two types of functions respectively, and ϵ being a tiny positive constant introduced to avoid zero being divided. The principle of our overshoot gate is similar to that of some momentum-based methods, e.g., FISTA (Beck & Teboulle, 2009) and LFISTA (Moreau & Bruna, 2017). The fundamental difference between these methods and ours is that, (L)FISTA considers that the scaling factor in a momentum term should be independent of the current inputs (including the previous estimation and y), i.e., being time or at least input invariant, while the output of the overshoot gate is a function of both the previous estimation and y, hence being time-and-input-varying. The design of our overshoot gate may endow the sparse coding network higher capability to learn from its inputs. Experimental comparisons in Section 7 in the Appendix confirm the superiority of our method. We also note that our convergence analyses in Section 3.1 generalize to η > 1 cases with a constant η, i.e., linear convergence can still be guaranteed, but the asymptotic behavior with learnable and adaptive overshoots should be further explored in future studies. Figure 2: The derivative function (illustrated in blue) of f(x, y) + λr(x), in which r(x) = x 1, is monotonic owing to the convexity of f(x, y) and r(x), and its output should be consistently smaller than the derivative (illustrated in orange) of the upper bound in absolute value. Let x be the optimal solution to the problem, then we know from the figure that the estimation with a standard ISTA update (i.e., η = 1) normally lags behind . 4 EXPERIMENTS In this section, we perform experiments to confirm our theoretical results and evaluate the performance of our gated sparse coding networks. Validations of our theoretical results are performed on synthetic data, and the performance of our method in sparse coding is tested on both synthetic and real data. We set m = 250, n = 500, and sample the elements of the dictionary matrix A randomly from a standard Gaussian distribution in simulations. The position of non-zero elements of the sparse vector xs is determined by a Bernoulli sampling with a probability of 0.1 (which means approximately 90% of the elements are set to be zero). Different noise levels and condition numbers are considered in the sparse coding simulations. We randomly synthesize in-stream xs and ε to obtain y for training, and we let two extra sets consisting of 1000 samples each as the validation and test sets, just like in prior works1. (Chen et al., 2018; Liu et al., 2019; Borgerding et al., 2017). For the proposed gated LISTA and other deep learning-based methods, we set d = 16 and let {b(t)} not be shared between different layers under all circumstances. The weight matrices {W (t), U (t)} are not shared either in our method and the coupled constraints W (t) = I U (t)A, t, are imposed. For all gates, νt is initialized as 1.0, and then we let the initial value of µt in the inverse proportional function powered gain gate be 1.0 too, since Eq. (12) and (13) indicate 0 µt 2. Other learnable parameters in our gates are uniformly initialized as 5.0 according to their suggested range of the gates. The training batch size is 64. We use Adam (Cho et al., 2014) and let β1 = 0.9 and β2 = 0.999. The hyper-parameters are tuned on the validation set and fixed for all our experiments in the sequel. Our training follows it of Chen et al. s (2018). That said, the sparse coding network is trained progressively to update more layers, and we cut the learning rate for currently optimized layers when no decrease in the validation loss can be observed for 4000 iterations, with a base learning rate of 0.0005. Training on current layers stops when the validation loss does not decrease any more with the learning rate being cut to 0.00001. More details are explained in Section 8 in the appendix. Our training objective for a network with d intermediate update steps is min Θ E x(d) xs 2 2, (19) 1The core codes of this paper could be find in github https://github.com/wukailun/GLISTA/. Published as a conference paper at ICLR 2020 (a) FPR vs index of layers (b) Ratio of lower components (c) NMSE vs # of layers Figure 3: Average results confirming our Proposition 1 and Theorem 2, in which intermediate outputs of a single network over five runs are reported in (a) and (b) while networks with varying depth are evaluated in (c), over five runs as well. It is plotted from layer indices of 1 in (a) and (b), since the two metrics (i.e., false positive rate and ratio of generated non-zero code components that require gains) do not make much sense with an initial code estimation (i.e., 0). (b) Figure 4: Average results over the whole test set confirm our Theorem 1, in which intermediate outputs of a single network is reported. It can be seen that: (a) the gate output converges to 1, and (b) LISTA with our gain gates converges as expected. It is plotted from layer index 1 in (a) since no gain is imposed on the initial code estimation (i.e., 0.) in which Θ = {W (t), U (t), b(t)}t=0,...,d 1 Λ(0) . . . Λ(d 1) is the set of all learnable parameters in the sparse coding network that generates x(d) given y. Note that in comparison with the parameter set in a standard LISTA, it also contains the parameters in gate functions. In practice, we are given a set of training samples and opt to minimize an empirical loss instead of the one in Eq. (19). Our evaluation metric for sparse coding is the normalized MSE (NMSE) (Chen et al., 2018): NMSE(x, xs) = 10 log10( x xs 2 2/ xs 2 2). (20) 4.1 SIMULATION EXPERIMENTS 4.1.1 VALIDATION OF THEORETICAL RESULTS Validation of Proposition 1: We first confirm Proposition 1. In order to ensure that LISTA fulfills the assumption about no false positive , we introduce an auxiliary loss into the learning object as: j / supp(xs) |x(t) j |. (21) We formally introduce the false positive rate (FPR) as FPR = |supp(x(t)) supp(xs)| |supp(xs)| |supp(x(t))| and try to approach no false positive (i.e., LISTA-nfp) by setting λ = 5.0 in the experiment. 2 Check Figure 3 for an illustrative comparison between different models, we see LISTA-nfp achieves almost no false positive in practice in Figure 3(a), but its convergence is slower as demonstrated in Figure 3(c), which is consistent with our result in Theorem 3. In addition, we also see in Figure 3(b) that without false positive , the code components in LISTA estimations are almost always less than those of the ground-truth, which confirms our Proposition 1. Validation of Theorem 1: We aim to calculate W (t)D (I U (t)A) 2 using a gated LISTA with the introduced Re LU-based piece-wise linear gain gate function 3. To accomplish this task, we need to first evaluate the output of our gate function, which is expect to converge to 1 as shown in the theorem. We show such a trend indeed exists in Figure 4(a). Consequently, the matrix D is supposed 2The FPR here is slightly different from the general false positive rate by calculating only in the obtained non-zero code components. 3Other functions can be adopted and the same results can be obtained. Published as a conference paper at ICLR 2020 (c) Figure 5: Comparison of different (a) overshoot gate functions, (b) gain gate functions, and (c) their combination over five runs. The experiment is performed with SNR=40d B. to be an identity matrix in the end and we can calculate W (t) (I U (t)A) 2 as a surrogate. In Figure 4(b), it converges to zero in the end and the results confirm the theorem. Validation of Theorem 2: We apply three kinds of gated LISTA with an alliance of gain gate functions (i.e., what has been introduced in Section 3.1.1), the exponential function, and the inverse proportional function respectively to verify our theoretical results. They were named as GLISTA (which is the abbreviation of gated LISTA), GLISTA-exp, GLISTA-inv, respectively. From Figure 3(c), we see that when the models with such gain gates has no false positive , all of them are superior to the standard LISTA without false positive as well, which is consistent with the conclusion of Theorem 2. In addition, from Figure 3(a), we can also see that there actually exist false positives in lower layers of GLISTA, but even without the auxiliary loss term, the evaluated FPR of our GLISTA and it variants approach zero in higher layers, which is in good agreement with Theorem 4. 4.1.2 COMPARISON WITH COMPETITORS Empirical analyses for the gate functions: It should be interesting to compare the performance of our gates with different expressions. We test LISTA with different overshoot gate functions introduced in Section 3.2 in Figure 5(a). Both of them are incorporated with their learnable parameters being shared among layers. It can be seen from Figure 5(a) that the accelerations in convergence and gain in final performance are obvious, just as expected. For LISTA with gain gates, one can check Figure 5(b). It can be seen that the performance degrades a lot if either the bias term or the µt term is removed. We also try different ft( ) functions, including the Re LU-based piece-wise linear one and some possibly more nonlinear ones as mentioned in Section 3.1. We confirm that gate functions whose outputs are relatively closer to the boundary condition may perform better. Yet, it is worth noting that when the outputs of inverse proportional function reach that boundary condition and being applied uniformly to all layers, the performance degrades (see LISTA-inv-ϵ in Figure 5(b)). These results suggest an alliance of gain gate functions in practice. We further test a combination of gain gates and overshoot gates, despite the mechanism with solely gain gates is already good enough. See Figure 5(c), when overshoots are further incorporated, the convergence on lower layers becomes faster while the overall convergence is not affected much, leading to similar final performance when the model is very deep and superior performance when the model is relatively shallow. Compared with other state-of-the-art methods: We consider four state-of-the-arts: LISTA with support selections (namely LISTA-C-S and LISTA-S, with and without the coupled constraint) (Chen et al., 2018), analytic LISTA with support selections (ALISTA-S) (Liu et al., 2019), and learned AMP (LAMP) (Borgerding et al., 2017) for comparison, and their official implementations are directly used. The hyper-parameters are set following the papers (Borgerding et al., 2017; Chen et al., 2018). We compare our GLISTA with these competitive methods under different levels of noises (including the signal-to-noise ratios (SNRs) being equal to 40d B, 20d B, and 10d B) and different condition numbers (including 3, 30, and 100, with SNR=40d B). See Figure 6 for comparisons between LISTA, LAMP, LISTA-S, LISTA-C-S, ALISTA-S, and our GLISTA in some of the settings. Obviously, the introduced gates facilitate LISTA significantly, and the concerned NMSE diminishes the fastest using GLISTA. See our Appendix for comparisons of final performance after multiple runs and the results in other settings (i.e., SNR: 20d B, 40d B, and condition number: 3). We know from these results that using the gain gates solely can already outperforms existing state-of-the-arts, while incorporating the overshoot gates additionally may further boost the performance, as testified. Applying our method to variants of LISTA: We also try adopting the introduced gates into some variants of LISTA to verify their generalization ability . Specifically, we incorporate the gain gates to LFISTA (Moreau & Bruna, 2017) and ALISTA (Liu et al., 2019) to obtain GFLISTA and AGLISTA, respectively. Since ALISTA is suggested to be implemented with support set selection in the original Published as a conference paper at ICLR 2020 (a) SNR=10d B (b) condition number=30 (c) condition number=100 Figure 6: Comparison of sparse coding methods in different settings over five runs. Our GLISTA consistently outperforms the competitors in almost all test cases with different numbers of layers. paper, i.e. ALISTA-S, we also compare with it. The experiment is performed under different levels of noises (40d B, 20d B, and 10d B). As can be seen in Table 1 in which average results along with their standard deviations calculated over five runs are reported, models with our gain gates perform significantly better, which verifies that our method generalizes well. Table 1: Comparison of LISTA and its variants (with and without gates) under different noise levels. SNR LISTA GLISTA LFISTA GLFISTA ALISTA AGLISTA ALISTA-S 40 -38.72 0.09 -45.22 0.02 -37.84 0.32 -38.30 0.10 -37.86 0.35 -42.30 0.13 -41.86 0.04 20 -18.65 0.09 -23.08 0.03 -20.90 0.02 -22.00 0.07 -17.38 0.05 -20.13 0.03 -20.00 0.05 10 -9.42 0.08 -11.41 0.02 -10.67 0.04 -11.20 0.01 -8.39 0.04 -9.13 0.02 -9.04 0.02 4.2 PHOTOMETRIC STEREO ANALYSIS We now test on a more practical task, i.e., photometric stereo analysis, using sparse coding. For a 3D object with Lambertian surface, if there are q different light conditions, a camera or some other kinds of sensors can obtain q different observations, all with noises caused by shadows and specularities. The observations can be represented as a vector o Rq for estimating the norm vector v R3 at any position on the surface. It is generally formulated as o = ρLv + e, in which L Rq 3 represents the normalized light directions (q directions), e Rq is a noise which is often sparse, ρ R represents the albedo reflectivity. Our task is to obtain v from o and L which is also known. The estimation of e can be considered as a sparse coding problem, and one can use L (o e) to recover v given the estimation. More detailed descriptions of the task can be found in Xin et al. s paper (2016). In the sparse coding problem, we have Q R(q 3) q (the orthogonal complement of L) as the dictionary matrix (i.e., A in Eq. (1)), e as the sparse code to be estimated, and Qo as the observation (i.e., y in Eq. (1)). We mainly follow settings in Xin et al. s work, e.g. the vectors of L are randomly selected from the hemispherical surface, except that we test with q = 15, 25, 35, and let 40% of the elements of e be non-zero. We use GLISTA here to estimate e and the final result for v is calculated as L (o e ), where L R3 q is the pseudo-inverse of L and e is the estimation. Our method is compared with LISTA and two traditional methods where no explicit training is introduced, i.e. the original least square (LS) and least L1, in Table 2. Our evaluation metric is the mean ( standard deviation) error in degree and it is calculated using the bunny picture (Xin et al., 2016). Table 2: Mean ( standard derivation) error in degree with different number of observations over five runs. q LS L1 LISTA GLISTA 35 5.37 1.39 0.0237 0.0026 0.00210 0.00044 25 5.60 2.03 0.0429 0.0068 0.00524 0.00024 15 6.09 4.25 0.371 0.046 0.0255 0.0054 5 CONCLUSION In this paper, we study LISTA for solving sparse coding problems. We discover its potential weaknesses and introduce gated mechanisms to address them accordingly. In particular, we theoretically prove that LISTA with gain gates can achieve faster convergence than the standard LISTA. We also discover that LISTA (with or without gates) can obtain lower reconstruction errors under a weaker assumption of false positive in its code estimations. It helps us improve the convergence analyses to achieve more solid theoretical results, which have been perfectly confirmed in simulation experiments. The effectiveness of our introduced gates is verified in a variety of sparse coding experiments and the state-of-the-art performance is achieved. In the future, we aim to extend the method to convolutional neural networks to deal with more complex tasks. Published as a conference paper at ICLR 2020 Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, 2009. Mark Borgerding, Philip Schniter, and Sundeep Rangan. Amp-inspired deep networks for sparse linear inverse problems. IEEE Transactions on Signal Processing, 65(16):4293 4308, 2017. Xiaohan Chen, Jialin Liu, Zhangyang Wang, and Wotao Yin. Theoretical linear convergence of unfolded ista and its practical weights and thresholds. In Advances in Neural Information Processing Systems, pp. 9061 9071, 2018. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014. Ingrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 57(11):1413 1457, 2004. Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression. The Annals of statistics, 32(2):407 499, 2004. Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing, 15(12):3736 3745, 2006. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440 1448, 2015. Raja Giryes, Yonina C Eldar, Alex M Bronstein, and Guillermo Sapiro. Tradeoffs between convergence speed and reconstruction accuracy in inverse problems. IEEE Transactions on Signal Processing, 66(7):1676 1690, 2018. Karol Gregor and Yann Le Cun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 399 406. Omnipress, 2010. Hao He, Bo Xin, Satoshi Ikehata, and David Wipf. From bayesian sparsity to gated recurrent nets. In Advances in Neural Information Processing Systems, pp. 5554 5564, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. John R Hershey, Jonathan Le Roux, and Felix Weninger. Deep unfolding: Model-based inspiration of novel deep architectures. ar Xiv preprint ar Xiv:1409.2574, 2014. Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012. Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015. Jialin Liu, Xiaohan Chen, Zhangyang Wang, and Wotao Yin. Alista: Analytic weights are as good as learned weights in lista. In Proceedings of the International Conference on Learning Representations, 2019. Thomas Moreau and Joan Bruna. Understanding trainable sparse coding via factorization. In Proceedings of the International Conference on Learning Representations, 2017. Jeremias Sulam, Aviad Aberdam, Amir Beck, and Michael Elad. On multi-layer basis pursuit, efficient algorithms and convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence, 2019. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. Zhangyang Wang, Qing Ling, and Thomas S Huang. Learning deep l0 encoders. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. Bo Xin, Yizhou Wang, Wen Gao, David Wipf, and Baoyuan Wang. Maximal sparsity with deep networks? In Advances in Neural Information Processing Systems, pp. 4340 4348, 2016. Published as a conference paper at ICLR 2020 Xu Xu, Xiaohan Wei, and Zhongfu Ye. Doa estimation based on sparse signal recovery utilizing weighted l1-norm penalty. IEEE signal processing letters, 19(3):155 158, 2012. Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma. Image super-resolution via sparse representation. IEEE transactions on image processing, 19(11):2861 2873, 2010. Jian Zhang and Bernard Ghanem. Ista-net: Interpretable optimization-inspired deep network for image compressive sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1828 1837, 2018. 6 PROOF OF THEOREMS AND PROPOSITIONS Before we delve deeply into the proof, we first give some importance notations. We define S as the support of the vector xs, i.e. S = supp(xs), and let |S| denote the number of elements in the set S. For a vector that shares the same size with xs, say z, we denote by z S R|S| a vector that keeps the elements with indices of z in S and removes the others. If the vectors have been introduced with subscripts already, e.g. xs, we use (xs)S to denote vectors obtained in such a manner. For a square matrix with the same number of row and column as the size of xs, say M, M(S, S) is its principal minor with the index set formed by removing rows and columns whose indices are not in S. Assume a vector x with no zero elements, sign( ) is defined as (sign(x))i = xi/|xi|, i.e. (sign(x))i = 1 when xi > 0, and (sign(x))i = 1 when xi < 0. 6.1 PROOF OF PROPOSITION 1 Recall that the update rule of LISTA is x(0) = 0 and x(t+1) = sb(t)(W (t)x(t) + U (t)y), t = 0, , d 1. (22) Proof. Recall the definition of S is S = supp(xs). For the shrinking function z = sb(t)(x) = sign(x)(|x| b(t))+ = x b(t)h(z), where h(z) = 1 if z > 0, h(z) = 1 if z < 0, and h(z) [ 1, 1] if z = 0. We use Mathematical Induction to prove supp(x(t)) S, t = 0, 1, . . . , d 1. We assume supp(x(t)) S. From the calculation of x(t+1) i , as W (t) = I U (t)A there is x(t+1) i = sb(t)((W (t)x(t) + U (t)y)i) = sb(t)((W (t)x(t) + U (t)Axs)i) = sb(t)(((I U (t)A)(x(t) xs))i + (xs)i) = ((I U (t)A)(x(t) xs))i + (xs)i b(t)h(x(t+1) i ). For i / S, (xs)i = 0. Let s assume x(t+1) i = 0, then h(x(t+1) i ) = sign(x(t+1) i ). Multiply the two sides of the Eq. (23) by sign(x(t+1) i ), as the b(t) = µ(A) supxs x(t) xs 1, there will be |x(t+1) i | = ((I U (t)A)(x(t) xs))isign(x(t+1) i ) b(t) = ((I U (t)A)(x(t) xs))isign(x(t+1) i ) µ(A) sup xs x(t) xs 1 µ(A) x(t) xs 1 µ(A) sup xs x(t) xs 1 0, the inequality holds for a b a b 1 and I U (t)A µ(A) (because of U (t) W(A)). From Eq. (24), we know |x(t+1) i | 0 actually is |x(t+1) i | = 0 which is in conflict with x(t+1) i = 0. Therefore, the x(t+1) i = 0, when i / S, i.e., supp(x(t+1)) S. As x(0) = 0 S, the supp(x(t)) S, t. The no false positive property has been proved. Published as a conference paper at ICLR 2020 According to Eq. (23), as support set of xs and x(t) are the subsets of S, there is x(t+1) i (xs)i = ((I U (t)A)(x(t) xs))i b(t)h(x(t+1) i ) j S (I U (t)A)ij(x(t) j (xs)j) b(t)h(x(t+1) i ) |x(t+1) i (xs)i| | X j S (I U (t)A)ij(x(t) j (xs)j)| + b(t). As supp(x(t+1)) S, accumulate all |x(t+1) i (xs)i| in Eq. (25) with i S, there is x(t+1) xs 1 X j S (I U (t)A)ij(x(t) j (xs)j) + |S|b(t) j S,i =j |(I U (t)A)ij||x(t) j (xs)j| + |S|b(t) (|S| 1)µ(A) x(t) xs 1 + |S|b(t). The second equation is because of U (t) W(A), so that |Wi,:A:,j| µ(A) when i = j and |Wi,:A:,j| = 1 when i = j. Substitute b(t) = µ(A) supxs x(t) xs 1 into Eq. (26), and take the supremum of Eq. (26). As |S| s there is sup xs x(t+1) xs 1 (s 1)µ(A) sup xs x(t) xs 1 + sµ(A) sup xs x(t) xs 1 (2s 1)µ(A) sup xs x(t) xs 1 ((2s 1)µ(A))t+1 sup xs x(0) xs 1. Let c = log((2s 1)µ(A)), the l2 error bound of t-th layer in LISTA should be calculated as x(t) xs 2 x(t) xs 1 sup xs x(t) xs 1 ((2s 1)µ(A))t sup xs x(0) xs 1 = exp(ct) sup xs x(0) xs 1 s B exp(ct), where the last inequality is deduced since (xs)i B, and xs 0 s. The linear convergence has been proved. Refer to the Eq. (25), as x(t+1) i = 0 when i / S, |x(t+1) i | |(xs)i| and x(t+1)(xs)i 0 hold certainly. We only concentrate on i S x(t+1) i (xs)i = ((I U (t)A)(x(t) xs))i b(t)h(x(t+1) i ). (29) If x(t+1) i = 0, there must be |x(t+1) i | = 0 (xs)i, and x(t+1) i (xs)i = 0. If x(t+1) i > 0, the according to Eq. (29), x(t+1) i (xs)i = ((I U (t)A)(x(t) xs))i b(t) = ((I U (t)A)(x(t) xs))i µ(A) supxs x(t) xs 1 0, i.e., 0 < x(t+1) i (xs)i, |x(t+1) i | |(xs)i|, and x(t+1) i (xs)i > 0. If x(t+1) i < 0, the according to Eq. (29), x(t+1) i (xs)i = ((I U (t)A)(x(t) xs))i + b(t) = ((I U (t)A)(x(t) xs))i+µ(A) supxs x(t) xs 1 0, i.e., 0 > x(t+1) i (xs)i, |x(t+1) i | |(xs)i|, and x(t+1) i (xs)i > 0. In conclusion, we can obtain |x(t+1) i | |(xs)i| and x(t+1)(xs)i 0 for all the situations. Published as a conference paper at ICLR 2020 6.2 PROOF OF THEOREM 1 Recall that the update rule of LISTA with gain gates is x(0) = 0 and x(t+1) = sb(t)(W (t)(gt(x(t), y|Λ(t) g ) x(t)) + U (t)y). (30) Proof. We assume that b(t) is a vector, i.e., b(t) Rn, in our proof to make it more general. According to definition of the shrinking function sb(t)( ) and y = Axs, Eq. (10) is x(t+1) = sb(t)(W (t)(gt(x(t), y|Λ(t) g ) x(t)) + U (t)y) = W (t)gt(x(t), y|Λ(t) g ) x(t) + U (t)y b(t) h(x(t+1)) = W (t)diag(gt(x(t), y|Λ(t) g ))x(t) + U (t)Axs b(t) h(x(t+1)) = W (t)diag(gt(x(t), y|Λ(t) g ))x(t) + U (t)Axs b(t) h(x(t+1)). Concentrate on the situation of t . Define gκ(xs) = gt(xs, y|Λ(t) g ) when t . In the main body of Theorem 1, xs satisfying xs 0 s is the fixed point of Eq. (10) when t . Eq. (31) is xs = W (t)diag(gκ(xs))xs + U (t)Axs b(t) h(xs). (32) The equation group of the indices in S in Eq. (32) is (xs)S = (((W (t)diag(gκ(xs)) + U (t)A)xs)S b(t) S h((xs)S) = (W (t)(S, S)diag(gκ((xs)S)) + (U (t)A)(S, S))(xs)S b(t) S h((xs)S). (33) Let (xs)S 0 but (xs)S = 0 so that h((xs)S) = sign((xs)S). As W (t), U (t), A and gκ(xs) = gt(xs, y|Λ(t) g ) are bounded, the right hand side of Eq. (33) is also tend to 0, which is b(t) S 0, as t . (34) As the S can be selected arbitrarily as long as |S| s, b(t) also satisfies b(t) 0, as t . (35) Substitute the b(t) S of Eq. (34) into Eq. (33), (xs)S is (xs)S = (W (t)(S, S)diag(gκ((xs)S))S + (U (t)A)(S, S))(xs)S, (36) where the W (t)(S, S) is defined at start of this section. Eq. (36) is (I U (t)A)(S, S)(xs)S = W (t)(S, S)diag(gκ((xs)S))(xs)S, (I U (t)A)(S, S)(xs)S = W (t)(S, S)diag((xs)S)gκ((xs)S), gκ((xs)S) = diag(((xs)S) 1)(W (t)(S, S)) 1(I U (t)A)(S, S)(xs)S, diag(((xs)S) 1)M(xs)S = gκ((xs)S), where M = (W (t)(S, S)) 1(I U (t)A)(S, S). The i-th row and j-th column element in M is denoted as mij. From Eq. (37), (gκ(xs))S is (gκ(xs))S = ((xs)S) 1 1 ((xs)S) 1 2 ... ((xs)S) 1 |S| m11 m12 ... m1|S| m21 m22 ... m2|S| ... ... ... ... m|S|1 m|S|2 ... m|S||S| ((xs)S)1 ((xs)S)2 ... ((xs)S)|S| P|S| i=1 m1i((xs)S)i ((xs)S)1 P|S| i=1 m2i((xs)S)i ((xs)S)2 ... P|S| i=1 m|S|i((xs)S)i Published as a conference paper at ICLR 2020 where this equation should hold for all xs in Assumption 1. Assume (xs)S 0, for gκ(xs) is bounded, we can conclude that mij = 0, if i = j. From Eq. (38), the final form of gκ((xs)S) is formulated as gκ((xs)S) = m11 m22 ... m|S||S| From Eq. (38), we can conclude that gκ(xs)i is a constant if i S, as the S could be arbitrary subset of {1, , n} as long as |S| s. We could deduce that gκ(xs)i is constant i {1, . . . , n} and gκ(xs) must be constant vector, i.e. diag(gκ(xs)) = D, (39) where D is an n n constant diagonal matrix. The first part of conclusion of Theorem 1 has been proved. Substitute b(t) in Eq. (34) and diag(gκ(xs)) in Eq. (39) into Eq. (32), Eq. (32) is. xs = (W (t)D + U (t)A)xs, xs = Zxs, (40) where Z = W (t)D + U (t)A = [Z1, Z2, . . . , Zn] and the Zi is the i-th column of Z. Give a xs satisfying only the i-th element of xs is non-zero and all the other elements are equal to zero, i.e., xs = [0, 0, . . . , ω, . . . , 0]T = ωei, in which ei is basis vector with only the i-th element being 1 and ω = 0. Substitute the xs = ωei into Eq. (40), there is ωei = [Z1, Z2, . . . , Zn][0, 0, . . . , ω, . . . , 0]T , ωei = ωZi, ω(ei Zi) = 0. As the Eq. (41) should hold for ω = 0, we can deduce that Zi = ei. As the i is selected arbitrarily, Z = W (t)D + U (t)A = [Z1, Z2, . . . , Zn] = [e1, e2, . . . , en] = I. Thus we have completed the proof and get W (t)D = (I U (t)A) as t . (42) 6.3 PROOF OF THEOREM 2 Recall that the update rule of LISTA with gain gates is x(0) = 0 and x(t+1) = sb(t)(W (t)(gt(x(t), y|Λ(t) g ) x(t)) + U (t)y). (43) Proof. We simplify the gt(x(t), y|Λ(t) g ) as gt(x(t)), and κt(x(t), y|Λ(t) g ) as κt(x(t)). According to the definition of gain gate in Eq. (43), we have x(t+1) = sb(t)(W (t)(x(t) gt(x(t)) + U (t)y) = sb(t)(W (t)(x(t) gt(x(t)) + U (t)Axs) = sb(t)((I U (t)A)(x(t) g(x(t)) xs) + xs) = (I U (t)A)(x(t) g(x(t)) xs) + xs b(t)h(x(t+1)). Simplify the x(t) g(x(t)) xs as gx(t). For the i-th equation in Eq. (44), and i / S, give the value of b(t) = µ(A) supxs x(t) g(x(t)) xs 1, there is x(t+1) i = ((I U (t)A)( gx(t)))i b(t)h(x(t+1) i ) = ((I U (t)A)( gx(t)))i µ(A) sup xs gx(t) 1h(x(t+1) i ). (45) Published as a conference paper at ICLR 2020 With almost the same proof process in Proposition 1, we could deduce that supp(x(t+1)) xs, (46) which is the no false poistive property. Recall the Eq. (44) and substitute the 1 + κt+1(x(t+1)) = gt+1(x(t+1)) into it: x(t+1) = ((I U (t)A)( gx(t))) b(t)h(x(t+1)) + xs, x(t+1)(1 + κt+1(x(t+1))) = (I U (t)A)( gx(t)) b(t)h(x(t+1)) + xs + x(t+1) κt(x(t+1)), gx(t+1) = (I U (t)A)( gx(t)) b(t)h(x(t+1)) + x(t+1) κt(x(t+1)). We shall calculate the non-zero | gx(t+1) i | with the index i. The i could be seperated to two parts. One is i S but i / supp(x(t+1)), another one part is i S and i supp(x(t+1)). Two kinds of i are discussed respectively. For i S but i / supp(x(t+1)), there must be x(t+1) i = 0 and 1 h(x(t+1) i ) 1. Select the i-th equation in Eq. (47), there is | gx(t+1) i | = |((I U (t)A)( gx(t)))i b(t)h(x(t+1) i )| j S,j =i | gx(t) j | + |b(t)|. (48) For i S and i supp(x(t+1)), there must be x(t+1) = 0 and h(x(t+1)) = sign(x(t+1)). Select the i-th equation in Eq. (47), there is | gx(t+1) i | = ((I U (t)A)( gx(t)))i b(t)h(x(t+1) i ) + x(t+1) i κt+1(x(t+1) i ) j S,j =i | gx(t) j | b(t)sign(x(t+1) i ) + sign(x(t+1) i )(|x(t+1) i |κt+1(x(t+1) i )) j S,j =i | gx(t) j | + (|x(t+1) i |κt+1(x(t+1) i ) b(t))sign(x(t+1) i ). According to the condition in Eq. (12) and (13), the 0 < κt(x) |x| < 2b(t 1). Then, |κt(x) |x| b(t 1)| < b(t 1), there must η < 1, so that |κt(x) |x| b(t 1)| ηb(t 1) < b(t 1). Substituting it to Eq. (49), there is | gx(t+1) i | µ(A) X j S,j =i | gx(t) i | + (|x(t+1) i |κt+1(x(t+1) i ) b(t))sign(x(t+1) i ) j S,j =i | gx(t) i | + ||x(t+1) i |κt+1(x(t+1) i ) b(t)| j S,j =i | gx(t) i | + ηb(t). Accumulate all the | gx(t+1) i | with all i S, and define s(t) = |supp(x(t))| as the number of non-zeros elements in x(t) there is gx(t+1) 1 X j S,j=i | gx(t) i | + (s(t+1)η + (|S| s(t+1)))b(t) j S,j=i | gx(t) i | + (s(t+1)η + (|S| s(t+1)))b(t) (|S| 1)µ(A) gx(t) 1 + (s(t+1)η + (|S| s(t+1)))µ(A) sup xs gx(t) 1. Published as a conference paper at ICLR 2020 Take the supremum of Eq. (51), let s(t) denote the infimum of s(t) with all of the xs X(B, s, 0), and ct = log((2s 1 s(t) (1 η))µ(A)), there is sup xs gx(t+1) 1 (s 1)µ(A) sup xs gx(t) 1 + (s(t+1) η + (s s(t+1) ))µ(A) sup xs gx(t) 1 (2s 1 + s(t+1) (1 η)) sup xs gx(t) 1 exp(ct+1) sup xs gx(t) 1 i=1 ci) sup xs gx(0) 1 exp( i=1 ci)s B. Eq. (52) gives the upper bound of supxs gx(t+1) 1, next we shall deduce the relationship between x(t) xs 1 and it. For the last layer (t-th layer), from Eq. (44), we have x(t+1) i = ((I U (t)A)( gx(t)))i µ(A) sup xs gx(t) 1h(x(t+1) i ) + (xs)i, x(t+1) i (xs)i = ((I U (t)A)( gx(t)))i µ(A)h(x(t+1) i ) sup xs gx(t) 1, |x(t+1) i (xs)i| |((I U (t)A)( gx(t)))i| + µ(A) sup xs gx(t) 1. Using almost the same process in Eq. (26) and Eq. (27), we could deduce Eq. (53) that x(t+1) xs 1 (2s 1)µ(A) sup xs gx(t) 1, x(t) xs 1 (2s 1)µ(A) sup xs gx(t 1) 1 (2s 1)µ(A) exp( i=1 ci + c)s B, where the third inequality sign holds because of Eq. (52) and the last equation holds because of c = log((2s 1)µ(A)). As at least ct c satisfies, there will be supxs x(t) xs 1 exp(ct)s B. Set t0 = log( s B xs 1 )/ log( 1 (2s 1)µ(A)) . When i > t0, supxs x(i) xs 1 exp(ci)s B < xs 1, x(i) xs 1 < xs 1, then the s(i) = |supp(x(i))| > 0, ci = log(2s 1 + s(i) (1 η))µ(A) < log((2s 1)µ(A)) = c. In conclusion, from Eq. (54), there is x(t) xs 2 x(t) xs 1 exp( i=1 ci + c)s B, (55) where c = log((2s 1)µ(A)), ci = c when i t0, and ci < c when i > t0. 6.4 PROOF OF THEOREM 3 Proof. For the t-th layer of the LISTA, according to the Eq. (23), we have x(t+1) i = ((I U (t)A)(x(t) xs))i + (xs)i b(t)h(x(t+1) i ). (56) Published as a conference paper at ICLR 2020 As we have remove the no false positive assumption, supp(x(t) i ) S. Define S(t) as i S(t) satisfies i supp(x(t)) but i / S. In order to calculate all of non-zero x(t+1) i (xs)i with index i, we divide the i into two kinds, i S and i S(t+1). If i S, from Eq. (56), we can deduce the same formulation as Eq. (25): |x(t+1) i (xs)i| X j supp(x(t)) (I U (t)A)ij(x(t) j (xs)j) + b(t) j supp(x(t)) |(I U (t)A)ij(x(t) j (xs)j)| + b(t). (57) If i S(t+1), then the (xs)i = 0 but x(t+1) i , then h(x(t+1) i ) = sign(x(t+1) i ). From Eq. (56), there is x(t+1) i = ((I U (t)A)(x(t) xs))i b(t)h(x(t+1) i ) = ((I U (t)A)(x(t) xs))i b(t)sign(x(t+1) i ). (58) Multiply sign(x(t+1) i ) on Eq. (58), we have |x(t+1) i | = ((I U (t)A)(x(t) xs))isign(x(t+1) i ) b(t), |x(t+1) i | + b(t) = ((I U (t)A)(x(t) xs))isign(x(t+1) i ), (|x(t+1) i | + b(t))sign(x(t+1) i ) = ((I U (t)A)(x(t) xs))i, which means ((I U (t)A)(x(t) xs))i have the same sign with sign(x(t+1) i ) because |x(t+1) i | > 0 and b(t) > 0, i.e., ((I U (t)A)(x(t) xs))i = |((I U (t)A)(x(t) xs))i|sign(x(t+1) i ). (60) From the Eq. (58), substitute Eq. (60) into Eq. (58), there is x(t+1) i (xs)i = x(t+1) i = sign(x(t+1) i )(| X j supp(x(t)) (I U (t)A)ij(x(t) j (xs)j)| |x(t+1) i (xs)i| = |x(t+1) i | = x(t+1) i sign(x(t+1) i ) j supp(x(t)) (I U (t)A)ij(x(t) j (xs)j)| b(t) j supp(x(t)) |(I U (t)A)ij(x(t) j (xs)j)| b(t) Accumulate all the |x(t+1) i (xs)i| with i supp(x(t+1)) supp(xs) = S(t+1) + S, there is x(t+1) xs 1 X j supp(x(t)) |(I U (t)A)ij(x(t) j (xs)j)| + (|S| |S(t+1)|)|b(t)|, (|S(t+1)| + |S|)µ(A) x(t) xs 1 + (|S| |S(t+1)|)b(t). Substitute the b(t) = ωt+1(kt+1|Θ)µ(A) supxs x(t) xs 1 into Eq. (62), and take its supremum of right part: x(t+1) xs 1 (|S(t+1)| + |S|)µ(A) sup xs x(t) xs 1 + (|S| |S(t+1)|)ωt+1(kt+1|Θ)µ(A) sup xs x(t) xs 1 (|S(t+1)| + |S| + (|S| |S(t+1)|)ωt+1(kt+1|Θ)) µ(A) sup xs x(t) xs 1. Published as a conference paper at ICLR 2020 Take the supremum of left part of (63), there is sup xs x(t+1) xs 1 exp(c t+1) sup xs x(t) xs 1, (64) where c t+1 = supxs log((|S(t+1)| + |S| + (|S| |S(t+1)|)ωt+1(kt+1|Θ))µ(A)). According to the definition of ωt( ), b(t) = ωt+1(kt+1|Θ)µ(A) supxs x(t) xs 1, so that the number of false positive is less or equal than kt+1, i.e. |S(t+1)| kt+1. According to the previous proof, when b(t) = µ(A) supxs x(t) xs 1, the number of false positive satisfies kt+1 = 0. That means that ω(0|Θ) 1 and ω(k|Θ) 1 when k > 0 4. The c t+1 should be c t+1 = sup xs log((|S(t+1)| + |S| + (|S| |S(t+1)|)ωt+1(kt+1|Θ))µ(A)) = log((s + kt+1 + (s kt+1)ωt+1(kt+1|Θ))µ(A)). (65) As assumption of ωt+1( |Θ), kt+1 0 , s.t. 0 < kt+1 0 < s, and ωt+1(kt+1 0 |Θ) < 1 1/(s kt+1 0 ). Select the value of b(t) so that kt+1 = kt+1 0 , we substitute it to Eq. (65), there will be c t+1 = log((s + kt+1 + (s kt+1)ωt+1(kt+1|Θ))µ(A)) < log((s + kt+1 + (s kt+1)(1 1 s kt+1 ))µ(A)) = log((2s 1)µ(A)). Recall the Eq. (64), we have sup xs x(t+1) xs 1 exp(c t+1) sup xs x(t) xs 1 i=1 c i ) sup xs x(0) xs 1 i=1 c i )s B. The l2 error bound of the t-th layer of LISTA is x(t) xs 2 x(t) xs 1 sup xs x(t) xs 1 s B exp( i=1 c i ), (68) where c i < log((2s 1)µ(A)). 6.5 PROOF OF THEOREM 4 Proof. For the t-th layer given in Eq. (10), according to Eq. (47), x(t+1) = ((I U (t)A)( gx(t))) b(t)h(x(t+1)) + xs, (69) and gx(t+1) = (I U (t)A)( gx(t)) b(t)h(x(t+1)) + x(t+1) κt(x(t+1)). (70) As the no false positive is not fit for x(t), x(t) S. We still define S(t) as i S(t) satisfies i supp(x(t) i ) but i / S and define S(t) as i S(t) satisfies i S and i supp(x(t)). In order to calculate the non-zero gx(t+1) i , we divide the i into three situations: i S(t+1), i / S(t+1) but i S, and i S(t+1). 4According to the definition of ωt, ωt must be a monotonic decreasing function and ωt(k|Θ) < 1 when k > 0. Published as a conference paper at ICLR 2020 For i S(t+1), there must be x(t+1) i = 0 , and (xs)i = 0. Substitute the form of κt into i-th equation of Eq. (70): gx(t+1) i = ((I U (t)A)( gx(t)))i b(t)sign(x(t+1)) + µt+1b(t)sign(x(t+1) i ) = ((I U (t)A)( gx(t)))i (1 µt+1)b(t)sign(x(t+1)), | gx(t+1) i | |((I U (t)A)( gx(t)))i| + (1 µt+1)b(t), we have assume µt 1 For i / S(t+1) but i S, x(t+1) i = 0, and (xs)i = 0. The i-th equation of Eq. (70) is gx(t+1) i = ((I U (t)A)( gx(t)))i b(t)sign(x(t+1) i ), | gx(t+1) i | |((I U (t)A)( gx(t)))i| + b(t). (72) For i S(t+1), (xs)i = 0. gx(t+1) i = x(t+1) i gt(x(t+1) i ) = ((I U (t)A)( gx(t)))i (1 µt+1)b(t)sign(x(t+1)), (73) As gt(x(t+1) i ) 1, the sign of x(t+1) i gt(x(t+1) i ) is the same as that of x(t+1) i . Multiply sign(x(t+1) i ) on Eq. (73), there is |x(t+1) i gt(x(t+1) i )| = ((I U (t)A)( gx(t)))isign(x(t+1)) (1 µt+1)b(t), |x(t+1) i gt(x(t+1) i )| + (1 µt+1)b(t) = ((I U (t)A)( gx(t)))isign(x(t+1)), (74) which means the ((I U (t)A)( gx(t)))i should have the same sign with sign(x(t+1)), i.e. gx(t+1) i = sign(x(t+1))(|((I U (t)A)( gx(t)))i| |(1 µt+1)b(t)|), | gx(t+1) i | |((I U (t)A)( gx(t)))i| (1 µt+1)b(t). (75) Accumulate all the | gx(t+1) i | with i supp(x(t+1)) supp(xs), there is gx(t+1) 1 = X i S(t+1),i S(t+1),i {S S(t+1)} | gx(t+1) i | i supp(x(t+1)) |((I U (t)A)( gx(t)))i| + ((|S(t+1)| |S(t+1)|)(1 µt+1) + (|S| |S(t+1)|))b(t) (|S| + |S(t+1)|)µ(A) gx(t) 1 + ((|S(t+1)| |S(t+1)|)(1 µt+1) + (|S| |S(t+1)|))b(t). Substitute the b(t) = ωt+1(kt+1|Θ)µ(A) supxs gx(t) 1 into Eq. (76). Take the supremum of the right part of Eq. (76): gx(t+1) 1 (|S| + |S(t+1)|)µ(A) sup xs gx(t) 1 + ((|S(t+1)| |S(t+1)|)(1 µt+1) + (|S| |S(t+1)|))ωt+1(kt+1|Θ)µ(A) sup xs gx(t) 1 (|S| + |S(t+1)| + ((|S(t+1)| |S(t+1)|)(1 µt+1) + (|S| |S(t+1)|))ωt+1(kt+1|Θ))µ(A) sup xs gx(t) 1. Let c t+1 = sup xs log((|S| + |S(t+1)| + ((|S(t+1)| |S(t+1)|)(1 µt+1) + (|S| |S(t+1)|))ωt+1(kt+1|Θ))µ(A)). (78) Published as a conference paper at ICLR 2020 Take the supremum of the left part of Eq. (77), the supxs gx(t+1) 1 satisfies sup xs gx(t+1) 1 exp(c t+1) sup xs gx(t) 1 i=1 c i) sup xs gx(0) 1 i=1 c i)s B. After the upper bound of gx(t) 1 is deduced, we shall consider about the relationship between x(t+1) xs and gx(t): If i S(t+1), (xs)i = 0, and x(t+1) i = 0. The i-th equation in Eq. (69) is x(t+1) i (xs)i = ((I U (t)A)( gx(t)))i b(t)sign(x(t+1) i ). (80) According to the similar analyses in previous, the sign of ((I U (t)A)( gx(t)))i is the same as sign(x(t+1) i ), x(t+1) i (xs)i satisfies x(t+1) i (xs)i = x(t+1) i = (|((I U (t)A)( gx(t)))i| b(t))sign(x(t+1) i ), |x(t+1) i (xs)i| = |x(t+1) i | = |((I U (t)A)( gx(t)))i| b(t). (81) If i S, the x(t+1) i (xs)i satisfies |x(t+1) i (xs)i| |((I U (t)A)( gx(t)))i| + b(t). (82) Accumulate all the |x(t+1) i (xs)i| with i supp(x(t+1)) supp(xs), there is x(t+1) xs 1 X i supp(x(t+1)) |((I U (t)A)( gx(t)))i| + (|S| |S(t+1)|)|b(t)| (|S| + |S(t+1)|)µ(A) gx(t) 1 + (|S| |S(t+1)|)b(t). Substitute the b(t) = ωt+1(kt+1|Θ)µ(A) supxs gx(t) 1 into Eq. (83), take the supremum of gx(t) 1: x(t+1) xs 1 (|S| + |S(t+1)|)µ(A) sup xs gx(t) 1 + (|S| |S(t+1)|)ωt+1(kt+1|Θ) sup xs gx(t) 1 (|S| + |S(t+1)| + (|S| |S(t+1)|)ωt+1(kt+1|Θ))µ(A) sup xs gx(t) 1. Let c be c = log((|S| + |S(t)| + (|S| |S(t)|)ωt(kt|Θ))µ(A)). (85) Substitute Eq. (79) and (85) into Eq. (84), the l2 error bound of LISTA with gain gate should be x(t) xs 2 x(t) xs 1 exp(c ) sup xs gx(t 1) 1 i=1 c i + c )s B. (86) Then we shall discuss the value of c i and c . Let t0 = log( s B σ )/ log( 1 (2s 1)µ(A)) . When i t0, as µi = 0, which means the gain gate does not exist, there is c i = sup xs log((|S| + |S(i)| + (|S| |S(i)|)ωi+1(ki+1|Θ))µ(A)) = log((s + ki+1 + (s ki+1)ωi+1(ki+1|Θ))µ(A)). (87) Published as a conference paper at ICLR 2020 ki 0, s.t. 0 < ki 0 < s and ωi(ki 0|Θ) < 1 1/(s ki 0). According to main process in the proof Theorem 3, let ki = ki 0, 0 < ki < s, and c i = c i < log((2s 1)µ(A)). When i > t0, supxs x(i) xs 1 < s B exp(ci) σ mini supp(xs) |(xs)i|. As the minimal absolute value of xs is less or equal than σ, S(t) = S. Select the b(i) so that ki = 0, ωi(ki|Θ) 1. Recall the form in Eq. (78), As supxs |Si| = ki = 0, and c i is c i = sup xs log((|S| + |S(i)| + (|S| |S(i)|)ωi(ki|Θ)µ(A)) log((s + s(1 µi))µ(A)). (88) As 1 ωi(s|Θ) < µi 1 c i = log((s + s(1 µi))µ(A)) < log((s + sωi(s|Θ))µ(A)) < log((s + sωi(ki 0|Θ))µ(A)) < log((s + ki 0 + (s ki 0)ωi(ki 0|Θ))µ(A)) = c i , i.e., c i < c i . As ωi( |Θ) is the monotone decreasing function, the second < in Eq. (89)holds since ki 0 < s and the last < holds since ki 0 > 0. |S| s, and |S(t)| kt. According to the assumption of ωt( |Θ), ki 0, s.t. 0 < ki 0 < s and ωi(ki 0|Θ) < 1 1/(s ki 0). Select b(t) to let kt = kt 0. According to the similar derivation in Theorem 3, c in Eq. (85) should satisfies c log((s+kt +(s kt)ωt(kt|Θ))µ(A)) < log((2s 1)µ(A)). All of the conclusions in Theorem 4 have been proven. 6.6 PROOF OF PROPOSITION 2 Recall that the update rule of ISTA is x(0) = 0 and x(t+1) = sλ/γ(x(t) xf(x(t))/γ). (90) We have the following theorem which analyzes the update rule of ISTA and η := arg min η f(η(x(t+1) x(t)) + x(t), y) + λ η(x(t+1) x(t)) + x(t) 1. (91) Proof. According to the analysis in Section 2 in the main paper, x(t+1) is the solution of minimizing the upper bound U(x), U(x) := f(x(t), y) + (x x(t)) xf(x(t)) + γ 2 x x(t) 2 + λr(x). (92) The sub-gradient of U(x) is x U(x) = xf(x(t)) + γ(x x(t)) + λ xr(x). (93) As the x(t+1) is the optimal solution to minimizing Eq. (92), x U(x(t+1)) satisfies 0 x U(x(t+1)) = xf(x(t)) + γ(x(t+1) x(t)) + λ xr(x(t+1)), (94) where r(x) = x 1. According to the definition of the sub-gradient, ( xr(x))i [ 1, 1] when xi = 0, ( xr(x))i = 1 when xi < 0, and ( xr(x))i = 1 when xi > 0. From the Eq. (94), there must exists r1 r(x) such that xf(x(t)) + γ(x(t+1) x(t)) + λr1 = 0, (95) where (r1)i = 1 if x(t+1) i > 0, (r1)i = 1 if x(t+1) i < 0, and 1 (r1)i 1 if x(t+1) i < 0. According to the definition of η in Eq. (91), we define a new function θ(η) as θ(η) = f(η(x(t+1) x(t)) + x(t), y) + λ η(x(t+1) x(t)) + x(t) 1. (96) Published as a conference paper at ICLR 2020 Notice that θ(η) is the line search function of f(x, y) + λ x 1. According to the law of convex optimization, as f(x, y) + λ x 1 is a convex function, the θ(η) must be also a convex function about η. The sub-gradient of θ(η) is xθ(η) = (x(t+1) x(t))T xf(η(x(t+1) x(t)) + x(t))+ λ(x(t+1) x(t))T xr(η(x(t+1) x(t)) + x(t)). (97) The η actually is the value to minimize θ(η) in Eq. (96). There must be 0 xθ(η ). (98) From Eq. (97), the sub-gradient function of θ(η) when η = 1 is =(x(t+1) x(t))T ( xf(x(t+1)) + λ xr(x(t+1))) =(x(t+1) x(t))T ( xf(x(t+1)) xf(x(t)) + xf(x(t))) + λ xr(x(t+1))) =(x(t+1) x(t))T ( 2 xf(ζ)(x(t+1) x(t)) + xf(x(t)) + λ xr(x(t+1))), where the last equation holds for Lagrange s mean value theorem and ζ Rn. Substitute xf(x(t)) in Eq. (95) into Eq. (99), the ηθ(1) is =(x(t+1) x(t))T ( 2 xf(ζ)(x(t+1) x(t)) γ(x(t+1) x(t)) λr1 + λ xr(x(t+1)) =(x(t+1) x(t))T (( 2 xf(ζ) γI)(x(t+1) x(t)) + λ( r(x(t+1)) r1)) =(x(t+1) x(t))T ( 2 xf(ζ) γI)(x(t+1) x(t))+ i (x(t+1) i x(t) i )(( r(x(t+1)))i (r1)i), where the sub-gradient r(x(t+1)) is a set. For all r r(x(t+1)), (x(t+1) x(t))T ( 2 xf(ζ) γI)(x(t+1) x(t))+ i (x(t+1) i x(t) i )((r )i (r1)i) ηθ(1). (101) We prove η 1 according to the counter-evidence. According to the properties of convex function and the sub-gradient, assume η < 1, s.t. 0 ηθ(η ), rθ ηθ(1), there will be However, as r1 r(x(t+1)), substitute r = r1 r(x(t+1)) into Eq. (100). The corresponding element in sub-gradient when r = r1 is rθ = (x(t+1) x(t))T ( 2 xf(ζ) γI)(x(t+1) x(t)) ηθ(1). According to given condition γI 2 xf(x) 0, rθ < 0, which is in contrast to rθ > 0. Therefore, the conclusion η 1 is obtained. Moreover, consider about the last term of Eq. (100), i.e. X i (x(t+1) i x(t) i )(( r(x(t+1)))i (r1)i). (102) If supp(x(t)) supp(x(t+1)), there are two situations about index i. 1) i supp(x(t+1)), there will be x(t+1) i = 0 and ( r(x(t+1)))i = (r1)i = sign(x(t+1) i ). 2) i / supp(x(t+1)) and i / supp(x(t)), there will be x(t+1) i = x(t) i = 0. Both conditions will make the term (x(t+1) i x(t) i )(( r(x(t+1)))i (r1)i) in Eq. (102) be 0. Therefore, Eq. (102) is X i (x(t+1) i x(t) i )(( r(x(t+1)))i (r1)i) = 0. (103) According to the given condition γI 2 xf(x) 0, ηθ(1) should be a number but not a set and ηθ(1) = (x(t+1) x(t))T ( 2 xf(ζ) γI)(x(t+1) x(t)) < 0. As the θ(η) is convex function, there must be η > 1 because of 0 ηθ(η ). The conclusion η > 1 is derived. Published as a conference paper at ICLR 2020 Figure 7: Experimental results validating our Proposition 2. It can be observed that the update of ISTA lags behind . (b) Figure 8: Comparison of overshoot and gain gate with similar methods over five runs. (a) SNR=20d B (b) SNR=40d B (c) condition number=3 Figure 9: Comparison of sparse coding methods under different settings over five runs. Our GLISTA consistently outperforms the competitors in almost all test cases with different numbers of layers. 7 MORE SIMULATION EXPERIMENTS Validation of Proposition 2: Some more experimental results are given here due to the length limit of the main body of our paper. One might also be interested in our Proposition 2, hence we first conduct an experiment to confirm it. We adopt ISTA with an adaptive overshoot and compare it with the standard ISTA for sparse coding. The adaptation is obtained via enlarging the step size from 1.0 through backtracking line search (see section 7 for more details). Figure 7 demonstrates that our overshoot mechanism facilitates ISTA optimization, and such a result confirms Proposition 2. Comparison with similar methods: As mentioned in the main body of the paper, the overshoot gates is proposed do address insufficient step size, which is similar to the motivation of (L)FISTA. LIHT and support select can also be considered as special cases of our gain gates (by letting µt = 1 in the inverse proportional function). We compare these similar methods with our overshoot and gain gates in Figure 8. It can be seen that when compared with LISTA, LFISTA converges faster in lower layers, and our overshoot gates also show such advantage. When applying to deeper layers, LFISTA converges quite slow while the overshoot gates still perform well, which indicates that the time-varying property is beneficial in practice. LISTA with our gain gates is obviously better than LIHT as shown in Figure 8(b), and sufficient experimental results in the paper also prove that the gain gate outperforms support select (e.g., in LISTA-C-S and LISTA-S). Published as a conference paper at ICLR 2020 Comparison under less challenging settings: Now we also give sparse coding results under the described less challenging settings on the noise level and the condition number in Figure 9. Compared with LISTA-CP, LAMP, LISTA-SS, and LISTA-CP-SS, our gated LISTA (GLISTA) performs remarkably better with less ill-posed dictionary matrices and less noises. Table 3 and 4 report the statistical means and standard deviations of five runs using different methods. It can be seen that the improvement achieved by our GLISTA is significant. Table 3: Comparison of the final NMSEs under different noise levels with d = 16. The condition number of the dictionary is not specifically constrained. SNR LISTA LAMP LISTA-S LISTA-C-S ALISTA-S GLISTA (ours) 40 -38.72 0.09 -36.77 0.60 -41.99 0.09 -44.85 0.02 -41.86 0.04 -45.22 0.02 20 -18.65 0.09 -18.66 0.09 -20.64 0.06 -22.84 0.02 -20.00 0.05 -23.08 0.03 10 -9.42 0.08 -9.46 0.66 -9.84 0.02 -11.06 0.01 -9.04 0.02 -11.41 0.02 Table 4: Comparison of the final NMSEs under different condition numbers with d = 16. The noise level is chosen as SNR=40d B for all the tested condition numbers. Con. num. LISTA LAMP LISTA-S LISTA-C-S ALISTA-S GLISTA (ours) 3 -39.03 0.54 -37.26 0.13 -43.12 0.06 -44.90 0.03 -43.88 0.26 -45.33 0.04 30 -29.65 0.89 -28.44 0.31 -32.30 0.17 -38.36 0.57 -31.50 0.15 -39.61 0.64 100 -21.39 0.75 -22.23 0.18 -27.08 0.57 -27.94 0.34 -27.10 0.02 -34.07 0.64 8 PROGRESSIVE TRAINING AND ADAPTIVE OVERSHOOT Our training mostly follows it of Chen et al. s (2018), and some key steps are listed here: 1) The model is trained progressively to include more layers during the training phase. At the very beginning, only learnable parameters in the first layer is considered, and parameters in the second layer is only included once training on the first update converges, so as the third and higher layers. 2) Training after including the t-th layer is split into three stages, with an initial learning rate of 0.0005 to optimize its own learnable parameters first, and learning rates of 0.0001 and 0.00001 to jointly optimize all learnable parameters from the 0-th to t-th layers in the second and third stages, respectively. We move to the next stage once no performance gain is observed on the validation set for 4000 iterations. 3) With the three stages done on the t-th layer, training moves to include the (t + 1)-th and the same three stages of training are performed. We perform an adaptive overshoot in the experiment to confirm Proposition 2. The algorithm is summarized in Algorithm 1. Most of input variables are introduced in the main body of our paper and τ is given as the step size for performing line search. The whole algorithm procedure is very similar to the famous backtracking line search. The step size η for sparse coding is updated by τ until the objective function f(x, y) + λr(x) does not decrease any more. Algorithm 1 ISTA with adaptive overshoot. Input: The dictionary matrix A, an observation y, an initial step size η0 = 1.0 for sparse coding, a step size τ = 1.05 for line search, and a maximal number of iteration. Output: output result 1: x(0) = 0; 2: for t = 0, , K 1 do 3: x(t) = sλ/γ((I AT A/γ)x(t 1) + AT y/γ); 4: xp = x(t), η = η0τ; 5: xc = τ( x(t) x(t)) x(t); 6: while fo(xp, y) fo(xc, y) do 7: xp = xc, η = τη; 8: xc = η( x(t) x(t)) x(t); 9: x(t) = xp; 10: return x(K 1) Published as a conference paper at ICLR 2020 Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, 2009. Mark Borgerding, Philip Schniter, and Sundeep Rangan. Amp-inspired deep networks for sparse linear inverse problems. IEEE Transactions on Signal Processing, 65(16):4293 4308, 2017. Xiaohan Chen, Jialin Liu, Zhangyang Wang, and Wotao Yin. Theoretical linear convergence of unfolded ista and its practical weights and thresholds. In Advances in Neural Information Processing Systems, pp. 9061 9071, 2018. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014. Ingrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 57(11):1413 1457, 2004. Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression. The Annals of statistics, 32(2):407 499, 2004. Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing, 15(12):3736 3745, 2006. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440 1448, 2015. Raja Giryes, Yonina C Eldar, Alex M Bronstein, and Guillermo Sapiro. Tradeoffs between convergence speed and reconstruction accuracy in inverse problems. IEEE Transactions on Signal Processing, 66(7):1676 1690, 2018. Karol Gregor and Yann Le Cun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 399 406. Omnipress, 2010. Hao He, Bo Xin, Satoshi Ikehata, and David Wipf. From bayesian sparsity to gated recurrent nets. In Advances in Neural Information Processing Systems, pp. 5554 5564, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. John R Hershey, Jonathan Le Roux, and Felix Weninger. Deep unfolding: Model-based inspiration of novel deep architectures. ar Xiv preprint ar Xiv:1409.2574, 2014. Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012. Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015. Jialin Liu, Xiaohan Chen, Zhangyang Wang, and Wotao Yin. Alista: Analytic weights are as good as learned weights in lista. In Proceedings of the International Conference on Learning Representations, 2019. Thomas Moreau and Joan Bruna. Understanding trainable sparse coding via factorization. In Proceedings of the International Conference on Learning Representations, 2017. Jeremias Sulam, Aviad Aberdam, Amir Beck, and Michael Elad. On multi-layer basis pursuit, efficient algorithms and convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence, 2019. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. Zhangyang Wang, Qing Ling, and Thomas S Huang. Learning deep l0 encoders. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. Bo Xin, Yizhou Wang, Wen Gao, David Wipf, and Baoyuan Wang. Maximal sparsity with deep networks? In Advances in Neural Information Processing Systems, pp. 4340 4348, 2016. Published as a conference paper at ICLR 2020 Xu Xu, Xiaohan Wei, and Zhongfu Ye. Doa estimation based on sparse signal recovery utilizing weighted l1-norm penalty. IEEE signal processing letters, 19(3):155 158, 2012. Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma. Image super-resolution via sparse representation. IEEE transactions on image processing, 19(11):2861 2873, 2010. Jian Zhang and Bernard Ghanem. Ista-net: Interpretable optimization-inspired deep network for image compressive sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1828 1837, 2018.