# stable_and_interpretable_unrolled_dictionary_learning__603d18cd.pdf

Published in Transactions on Machine Learning Research (08/2022)

Stable and Interpretable Unrolled Dictionary Learning

Bahareh Tolooshams btolooshams@seas.harvard.edu Demba Ba demba@seas.harvard.edu School of Engineering and Applied Sciences Harvard University

Reviewed on Open Review: https: // openreview. net/ forum? id= e3S0Bl2RO8

The dictionary learning problem, representing data as a combination of a few atoms, has long stood as a popular method for learning representations in statistics and signal processing. The most popular dictionary learning algorithm alternates between sparse coding and dictionary update steps, and a rich literature has studied its theoretical convergence. The success of dictionary learning relies on access to a good initial estimate of the dictionary and the ability of the sparse coding step to provide an unbiased estimate of the code. The growing popularity of unrolled sparse coding networks has led to the empirical finding that backpropagation through such networks performs dictionary learning. We offer the theoretical analysis of these empirical results through PUDLE, a Provable Unrolled Dictionary LEarning method. We provide conditions on the network initialization and data distribution sufficient to recover and preserve the support of the latent code. Additionally, we address two challenges; first, the vanilla unrolled sparse coding computes a biased code estimate, and second, gradients during backpropagated learning can become unstable. We show approaches to reduce the bias of the code estimate in the forward pass, and that of the dictionary estimate in the backward pass. We propose strategies to resolve the learning instability by tuning network parameters and modifying the loss function. Overall, we highlight the impact of loss, unrolling, and backpropagation on convergence. We complement our findings through synthetic and image denoising experiments. Finally, we demonstrate PUDLE s interpretability, a driving factor in designing deep networks based on iterative optimizations, by building a mathematical relation between network weights, its output, and the training set.

1 Introduction

This paper1 considers the dictionary learning problem, namely representing data x X Rm as linear combinations of a few atoms from a dictionary D D Rm p. Given x and D, the problem of recovering the sparse (few non-zero elements) coefficients z Rp is referred to as sparse coding, and can be solved through the lasso (Tibshirani, 1996) (also known as basis pursuit (Chen et al., 2001)):

ℓx(D) := minz Rp Lx(z, D) + h(z) (1)

where Lx(z, D) = 1

2 x Dz 2 2, and h(z) = λ z 1. Specifically, the problem aims to recover a dictionary D that generates the data, i.e., x = D z (2)

where z is sparse. Olshausen and Field (Olshausen & Field, 1997) introduced (2) in computational neuroscience as a model for how early layers of the visual cortex process natural images. Sparse coding has been widely studied and utilized in the statistics (Hastie et al., 2015) and signal processing communities (Elad, 2010). A few practical examples are denoising (Elad & Aharon, 2006), super-resolution (Yang et al., 2010), text processing (Jenatton et al., 2011), and classification (Mairal et al., 2009b), where it enables the extraction

1Source code is available at https://github.com/btolooshams/stable-interpretable-unrolled-dl

Published in Transactions on Machine Learning Research (08/2022)

Figure 1: Provable unrolled dictionary learning (PUDLE): Unrolled network architecture with dictionary D.

of sparse high-dimensional features representing data. Moreover, sparse modelling is ubiquitous in many other fields such as seismic signal processing (Nose-Filho et al., 2018), radar sensing for target detections (Bajwa et al., 2011), and astrophysics for image reconstruction from interferometric data (Akiyama et al., 2017). Furthermore, Cleary et al. (2017; 2021) use this model to learn a dictionary consisting of gene modules for efficient imaging transcriptomics.

Sparse coding has been utilized to construct neural architectures through approaches such as sparse energybased models (Ranzato et al., 2007; 2008) or recurrent sparsifying encoders (Gregor & Le Cun, 2010). The latter has initiated a growing literature on constructing interpretable deep networks based on an approach referred to as algorithm unrolling (Hershey et al., 2014; Monga et al., 2019). Deep unrolled neural networks have gained popularity as inference maps in recent years due to their computational efficiency and their performance in various domains such as image denoising (Simon & Elad, 2019; Tolooshams et al., 2021a; 2020), super-resolution (Wang et al., 2015), medical imaging (Solomon et al., 2020), deblurring (Schuler et al., 2016; Li et al., 2020), radar sensing (Tolooshams et al., 2021b), and speech processing (Hershey et al., 2014).

Prior to the advent of unrolled networks, gradient-based dictionary learning relied on analytic gradients computed from the lasso given the sparse code. With unrolled networks, automatic differentiation (Baydin et al., 2018), referred to as backpropagation (Le Cun et al., 2012) in the reverse-mode, gained attention for parameter estimation (Tolooshams et al., 2018). The automatic gradient is obtained by backpropagation through the algorithm used to estimate the code. Automatic differentiation in reverse and forward-mode (Franceschi et al., 2017) is used in other areas, e.g., hyperparameter selection (Feurer & Hutter, 2019), and in a more relevant context, in the seminal work of LISTA (Gregor & Le Cun, 2010). Other works demonstrated empirically the convergence of ℓ1-based dictionary learning by backpropagation through unrolled networks (Tolooshams et al., 2021a). Given finite computational power, Tolooshams et al. (2021a) convert sparse coding into an encoder by unrolling T iterations of ISTA (Daubechies et al., 2004; Blumensath & Davies, 2008), and attach to it a linear decoder for reconstructing. Unrolled networks obtained in this manner suffer from two important limitations.

First, the sparse coding step in the forward pass computes a biased estimate of the code. This results, in turn, in a biased estimate of the backward gradient and, hence, a degradation of dictionary recovery performance. Second, as studied recently (Malézieux et al., 2022), inaccuracies in the early iterations of the unrolled network make backpropagation unstable. We address both of these shortcomings in this paper. Moreover, while Malézieux et al. (2022) analyze the gradient computed by backpropagation through unrolled sparse-coding networks, there is no known theoretical analysis of how weight updates using this gradient impact the recovery of a ground-truth code z , nor of their convergence to a ground-truth dictionary D .

This paper proposes a Provable Unrolled Dictionary LEarning (PUDLE) (Figure 1). We aim to recover D

by training the network using backpropagation with a learning rate of η. Three different choices affect the gradient: the number of unrolled iterations, the loss, and whether one backpropagates through the decoder only or through both the encoder and decoder. We highlight the impact of such choices on the convergence of the training algorithm. Backpropagation through the decoder results in the analytic gradient gdec t using the code estimate zt. The gradients gae-lasso t and gae-ls t are computed by backpropagation through the autoencoder using the lasso and least-squares objectives, respectively (Algorithm 2). We compare the gradients with the classical gradient-based alternating-minimization algorithm for dictionary learning (Chatterji & Bartlett, 2017) (i.e., cycling between sparse coding and dictionary update steps using the analytic gradient ˆg (Algorithm 1)),

Published in Transactions on Machine Learning Research (08/2022)

and provide a theoretical analysis of gradient-based recovery of the dictionary D . We provide sufficient conditions under which the gradient computation, hence the learning, is stable. Additionally, we show how using the reconstruction loss with backpropagation not only does not suffer from backpropagated instability but also ameliorates the propagation of the forward pass bias into the gradient estimate from the backward pass. Finally, we demonstrate the interpretability of the unrolled network. Our contributions are:

Unrolled sparse coding Unlike prior work (Malézieux et al., 2022) that studies the ability of sparse coding to recover the solution of the lasso (1) given the current estimate of the dictionary (we call this local estimation), we study unrolled sparse coding for recovery of the true generating code in (2) (we call this global estimation). We provide sufficient conditions on the network and data distributions such that the forward pass recovers (Theorem 4.1) and preserves (Theorem 4.2) the correct code support. Assuming support identification, we show the linear convergence of the code estimated through the unrolled iterations to the solution of the lasso (Theorem 4.3). We provide an explicit code expression at unrolled layer t and its error with respect the ground-truth code z ; we highlight the biased estimate of the code when the forward pass strictly solves lasso (Theorems 4.4 and 4.5). Moreover, in a more general scenario, we show that the error in the code estimate is upper bounded by two terms, i.e., one associated with the dictionary error and the other to the bias of the estimate of code amplitude, due to ℓ1-based optimization (Theorem 4.6). The latter highlights that vanilla lasso (ℓ1-based) sparse coding computes a biased estimate of codes, and below we discuss strategies to either alleviate this bias in the forward pass or mitigate its propagation into the backward pass for dictionary learning. Mitigation of coding bias propagation into dictionary learning We study gradient estimation for dictionary learning in PUDLE. We decompose the upper bound on the gradient errors compared to the gradient direction to recover D into terms involving the current dictionary error, the bias of the code estimate, and the lasso loss used to compute the gradient. We show that using only the reconstruction loss while backpropagating (i.e., gae-ls t ) results in the vanishing of the upper bound due to the usage of lasso loss. This means that given fixed λ, gae-ls t ameliorates the propagation of the forward pass bias into the backward pass. Specifically, we show that gae-ls t is a better estimator of the direction to recover D than gdec t and gae-lasso t . Hence, weight updates using gae-ls t converges to a closer neighbourhood of D (Theorem 4.10). In a supervised image denoising task, we show that the advantage of gae-ls t goes beyond dictionary learning; gae-ls t results in better image denoising compared to gdec t . Furthermore, our network outperforms the sparse coding scheme in NOODL, a state-of-the-art online dictionary learning algorithm (Rambhatla et al., 2018) (Table 1). Moreover, we show that the bias in the estimate of D vanishes as λt = λνt (with 0 < ν < 1) decays within the forward unrolled layers (Figure 16). This strategy, supported by Theorem 4.4, results in an unbiased estimate of the code z and recovery of D (Theorem 4.11). Stability of unrolled learning Our approach to resolve the instability issue of backpropagation in unrolled networks is two-fold. First, we show that under proper dictionary initialization, the instability of the gradient gae-lasso t computation, studied by Malézieux et al. (2022), as T increases is resolved. We give a condition under which the code support is identified and recovered after one iteration and, hence, gradient computation stays stable. Second, in the absence of support identification in early iterations, we propose to use the gradient gae-ls t which resolves the stability issue introduced by lasso loss in the backward pass. We highlight this stability through image denoising training without gradient explosion (Figure 6). Interpretable sparse codes and dictionary Prior work has discussed algorithm unrolling for designing interpretable deep architectures based on optimization models (Monga et al., 2019), or interpretability of sparse representations in dictionary learning models (Kim et al., 2010). However, there is no known work to mathematically characterize the interpretability of unrolled network architectures. In this regard, first, we construct a mathematical relation between learned weights (dictionary) at gradient convergence and the training data (Theorem 5.1). Second, we relate the inferred representation/reconstruction of test examples to the training data. We highlight several interpretable features of the unrolled dictionary learning network. Specifically, we perform analysis that provide insights into questions such as why am I learning a particular feature in the dictionary? or from what part of the training set or an image I am learning that feature? (Figure 8). Moreover, we provide an explanation of the relation between the new test image denoised/reconstructed through the network and the training dataset. The model provides insights on how training images are used to reconstruct a new test image (Figure 9) or how the test image picks up training images that have a similar representation to itself to reconstruct (Figure 10).

Published in Transactions on Machine Learning Research (08/2022)

2 Related Works

There is vast literature on the theoretical convergence of dictionary learning. Spielman et al. (2012) proposed a factorization method to recover the dictionary in the undercomplete setting (i.e., p m). Barak et al. (2015) proposed to solve dictionary learning via sum-of-squares semidefinite program. K-SVD (Aharon et al., 2006) and MOD (Engan et al., 1999) are popular greedy approaches. Alternating-minimization-based methods have been used extensively in theory and practice (Jain et al., 2013; Agarwal et al., 2014; Arora et al., 2014).

Recent work has incorporated gradient-based updates into alternating minimization (Chatterji & Bartlett, 2017; Arora et al., 2015; Rambhatla et al., 2018). Chatterji & Bartlett (2017) provided a finite sample analysis and convergence guarantees when updating the dictionary using the analytic gradient. Arora et al. (2015) proposed neurally plausible sparse coding approaches with analytic gradients. Another work focused on online dictionary learning (Mairal et al., 2009a) with an unbiased gradient updates (Rambhatla et al., 2018). Arora et al. (2015) discussed methods to reduce the bias of dictionary estimate, and Rambhatla et al. (2018) showed how to reduce bias in code and dictionary estimates. A common feature in the above-mentioned work is the use of analytic gradients, i.e., explicitly designing gradient updates independent of the sparse coding step and not utilizing automatic gradients with deep learning optimizers. A theoretical analysis of backpropagation for dictionary learning exists only for shallow autoencoders (Rangamani et al., 2018; Nguyen et al., 2019).

The theoretical analysis of unrolled neural networks has mainly analyzed the convergence speed of variants of LISTA (Gregor & Le Cun, 2010), where the focus is on sparse coding (i.e., the encoder) not dictionary learning (Sprechmann et al., 2012; Xin et al., 2016; Moreau & Bruna, 2017; Giryes et al., 2018; Chen et al., 2018; Liu & Chen, 2019; Ablin et al., 2019). Moreau & Bruna (2017) showed that upon successful factorization of the Gram matrix of the dictionary within layers, the network achieves accelerated convergence. Giryes et al. (2018) examined the tradeoffs between reconstruction accuracy and convergence speed of LISTA. Moreover,

Chen et al. (2018) studied the learning dynamics of the weights and biases of unrolled-ISTA and proved that it achieves linear convergence. Follow-up works investigated the dynamics of step size in a recursive sparse coding encoder (Liu & Chen, 2019; Ablin et al., 2019). Ablin et al. (2019) minimized the lasso through backpropagation but still assumed the knowledge of the dictionary at the decoder.

Ablin et al. (2020) compared analytic and automatic gradient estimators of min-min optimizations with smooth and differentiable functions. Moreover, Malézieux et al. (2022) studied the stability of gradient approximation in the early regime of unrolling for dictionary learning. Unlike our work, where we evaluate the gradients for model recovery, Ablin et al. (2020) and Malézieux et al. (2022) studied the asymptotic gradient errors locally in each step of an alternating minimization and did not provide errors concerning z or D .

3 Preliminaries

Given n independent samples, dictionary learning aims to minimize the empirical risk, i.e.,

min D D Rn(D) with Rn(D) 1

n Pn i=1 ℓxi(D) (3)

where limn Rn(D) = Ex X [ℓx(D)] a.s. To prevent scaling ambiguity between the code z and dictionary D, it is common to constrain the norm of the dictionary columns. Hence, we define the set of feasible solutions for the dictionary as D {D Rm p s.t. j {1, 2, . . . , p}, Dj 2 2 1}. We can project estimates of D onto the feasible set by performing Dj 1/max( Dj 2,1)Dj, either at every update or at the end of training. We assume certain properties on the data, specifically its domain (Assumption 3.1), energy (Assumption 3.2), code distribution (Assumption 3.3), and generating dictionary (Assumption 3.4).

Assumption 3.1 (Domain signals). X and D are both compact convex sets.

Assumption 3.2 (Bounded signals). M > 0 s.t. x 2 < M x X.

Assumption 3.3 (Code distribution). The code z is at most s-sparse with the support S = supp(z ). Each element in S is chosen from the set [1, p], uniformly at random without replacement. pi = P(i S ) = Θ(s/p), and pij = P(i, j S ) = Θ(s2/p2). Given the support, z S is i.i.d, has symmetric probability distribution density function, E[z (i) | i S ] = 0 and E[z (S)z T (S ) | S ] = I. Moreover, the non-zero entries of the code are sub-Gaussian and lower bounded, i.e., for i S , |z (i)| Cmin where 0 < Cmin 1.

Published in Transactions on Machine Learning Research (08/2022)

Assumption 3.4 (Generating dictionary). D is µ-incoherent (see Definition A.1) where µ = O(log (m)). D is unit-norm columns matrix ( D i 2 = 1), D 2 = O( p

p/m), and p = O(m).

To achieve model recovery using gradient descent, we assume an appropriate dictionary initialization, i.e.,

Assumption 3.5 (Dictionary closeness). The initial dictionary D(0) is (δ0, 2)-close to D (see Definition A.2). The dictionary closeness at every update is denoted by D(l) j D j 2 δl j. Furthermore, δl = O (1/ log p).

Arora et al. (2015) proposed a dictionary initialization method offering (δ, 2)-close to D for δ = O (1/ log m). The method is based on pairwise reweighting of samples {xi}n i=1 from the generative model (2), and does not require access to D . In addition, Rambhatla et al. (2018) utilize dictionary closeness assumptions and such dictionary initialization for their theoretical analysis. Moreover, Agarwal et al. (2017) proposed a clustering approach to find a close initial estimate of the dictionary.

Given the µ-incoherence of D (Assumption 3.4) and δl-closeness of the dictionary, D(l) is µl-incoherent, i.e.,

Lemma 3.1 (µl-incoherent). D(l) is µl-incoherent where µl = µ + 2 mδl.

The recurrent encoder and decoder, which perform the computations shown in Algorithm 2, use the loss L and proximal operator Pb(v) sign(v) max(|v| b, 0) for the ℓ1 norm h: Rp R. The encoder implements ISTA (Daubechies et al., 2004; Blumensath & Davies, 2008) with step size α, assumed to be less than 1/σ2 max(D). With infinite encoder unrolling, the encoder s output is the solution to the lasso (1), following the optimality condition (Lemma A.3) where we denote fx(z, D) Lx(z, D) + h(z). One immediate observation is that λ DTx {0} arg min fx(z, D). We assume λ < DTx . We specify in Theorem 4.1 and Theorem 4.2 the conditions on λ at every encoder iteration to ensure support recovery and its preservation through the encoder. In case of a constant λ across encoder iterations while using D as the dictionary (i.e., sparse coding using ℓ1 norm), the network recovers a biased code ˆz . We denote this amplitude error in the code by ˆδ ˆz z 2 which is small and goes to zero with λ decaying through the encoder.

In addition, we assume the solution to (1) is unique; sufficient conditions for uniqueness in the overcomplete case (i.e., p > m) are extensively studied in the literature (Wainwright, 2009; Candès & Plan, 2009; Tibshirani, 2013). Tibshirani (2013) discussed that the solution is unique with probability one if entries of D are drawn from a continuous probability distribution (Tibshirani, 2013) (Assumption 3.6). This assumption implies that DT S DS is full-rank. We argue that as long as the data x X are sampled from a continuous distribution, this assumption holds for the entire learning process. The preservation of this property is guaranteed at all iterations of the alternating minimization proposed in (Agarwal et al., 2014). Moreover, this assumption has been previously considered in analyses of unrolled sparse coding networks (Ablin et al., 2019; Malézieux et al., 2022) and can be extended to ℓ1-based optimization problems (Tibshirani, 2013; Rosset et al., 2004).

Assumption 3.6 (Lasso uniqueness). The entries of the dictionary D are continuously distributed. Hence, the minimizer of (1) is unique, i.e., ˆz = arg min fx(z, D) with probability one.

Lemma 3.2 states the fixed-point property of the encoder recursion (Parikh & Boyd, 2014). Given the definitions for Lipschitz and Lipschitz differentiable functions, (Definitions A.3 and A.4), the loss L and function h satisfy following Lipschitz properties.

Lemma 3.2 (Fixed-point property of lasso). Given Assumption 3.6, we have 0 1Lx(ˆz, D) + h(ˆz). The minimizer is a fixed-point of the mapping, i.e., ˆz = Pαλ(ˆz α 1Lx(ˆz, D)) = Φ(ˆz) (Parikh & Boyd, 2014).

Lemma 3.3 (Lipschitz differentiable least squares). Given Lx(z, D) = 1

2 x Dz 2 2, D, and Assumption 3.2, the loss is Lipschitz differentiable. Let L1 and L2 denote the Lipschitz constants of the first derivatives 1Lx(z, D) and 2Lx(z, D), L11 and L21 the Lipschitz constants of the second derivatives 2 11Lx(z, D) and 2 21Lx(z, D), all w.r.t z. Let 1Lx(z, D) be L1D-Lipschitz w.r.t D, and we denote the Lipschitz constant of 11Lx(z, D) and 21Lx(z, D) w.r.t to D by L11D and L21D, respectively.

Lemma 3.4 (Lipschitz proximal). Given h(z) = λ z 1, its proximal operator has bounded sub-derivative, i.e., Pλ(z) 2 cprox.

Published in Transactions on Machine Learning Research (08/2022)

Algorithm 1: Classical alternating-minimization-based dictionary learning using lasso (1).

Initialize: Samples {xi}n i=1 X, initial dictionary D(0)

Repeat: l = 0, 1, . . . , number of epochs Sparse coding step: zi(l) = arg minz Lxi(z, D(l)) + h(z), (for i [1, n]) Dictionary update: D(l+1) = D(l) ηˆg(l) where ˆg(l) 1 n Pn i=1 2Lxi(zi(l), D(l))

Algorithm 2: PUDLE: Provable unrolled dictionary learning framework.

Initialize: Samples {xi}n i=1 X, initial dictionary D(0), and z0 = 0. Repeat: l = 0, 1, . . . , number of epochs Forward pass: (for i [1, n])

Encoder: zi(l) t+1 = Φ(zi(l) t , D(l)) = Pαλ(zi(l) t α 1Lxi(zi(l) t , D(l))) (repeat for T) Decoder: ˆxi(l) = D(l)zi(l) T (6)

Backward pass: D(l+1) = D(l) ηg(l) T where g(l) T is either of

g(l) dec T 1 n Pn i=1 2Lxi(zi(l) T , D(l))

g(l) ae-lasso T 1 n Pn i=1 2Lxi(zi(l) T , D(l)) + Ji(l)+ T 1Lxi(zi(l) T , D(l)) + h(zi(l) T )

g(l) ae-ls T 1 n Pn i=1 2Lxi(zi(l) T , D(l)) + Ji(l)+ T 1Lxi(zi(l) T , D(l))

See Definition 4.1 for J+ T .

4 Unrolled Dictionary Learning

The gradients defined in PUDLE (Algorithm 2) can be compared against the local direction at each update of classical alternating-minimization (Algorithm 1). Assuming there are infinite samples, i.e.,

Best local direction : ˆg limn 1

n Pn i=1 2Lxi(ˆzi, D) = Ex X [ 2Lx(ˆz, D)] (4)

where ˆz = arg minz Rp Lx(z, D) + h(z). Additionally, to assess the estimators for model recovery, hence dictionary learning, we compare them against gradient pointing towards D , namely

Desired global gradient for D : g limn 1

n Pn i=1 2Lxi(zi , D) = Ex X [ 2Lx(z , D)]. (5)

To see why the above is the desired direction, (z , D ) is a critical point of the loss L which reaches zero for data following the model (2). Hence, to reach D arg min D D Ex X [Lx(z , D)], we move towards the direction minimizing the loss in expectation. Specifically, using the gradient 2Lx(z , D) = (x Dz )z T = (D D )z z T as a descent direction, we move from D toward D modulo the code presence matrix z z T. Given these directions, we analyze the error of the gradients gdec t , gae-lasso t , and gae-ls t assuming infinite samples. In local analysis, we compare the code and gradient estimates to the lasso optimization in each update of the alternating minimization. In global analysis, we evaluate the performance in recovery of the ground-truth code z and the dictionary D . In this regard, we first study the forward pass.

4.1 Forward pass

We show convergence results in the forward pass for z and the Jacobian, i.e.,

Definition 4.1 (Code Jacobian). Given D, the Jacobian of zt is defined as Jt zt

D with adjoint J+ t .

The forward pass analyses give upper bounds on the error between zt and ˆz and the error between Jt and ˆ J as a function of unrolled iterations t. We define ˆ J as following: considering the function z Lx(z, D)+h(z), ˆz(D) is its minimizer and ˆ J = ˆz(D)

D . We will require these errors in Section 4.2, where we analyze the gradient estimation errors. Similar to (Chatterji & Bartlett, 2017), the error associated with gdec t depends on

Published in Transactions on Machine Learning Research (08/2022)

the code convergence. Unlike gdec t , the convergence of backpropagation with gradient estimates gae-lasso t and gae-ls t relies on the convergence properties of the code and the Jacobian (Ablin et al., 2020). Forward-pass theories are based on studies by Gilbert (1992) on the convergence of variables and their derivatives in an iterative process governed by a smooth operator (Gilbert, 1992). Moreover, Hale et al. (2007) studied the convergence analysis of fixed point iterations for ℓ1 regularized optimization problems (Hale et al., 2007).

Support recovery and preservation We re-state a result from (Hale et al., 2007) on support selection.

Proposition 4.1 (Finite-iteration support selection). Given Assumption 3.6, let ˆz = arg min fx(z, D) with S supp(ˆz). There exists a B > 0 such that supp(zt) = S, t > B.

This means the unrolled encoder identifies the support in finite iterations. Support recovery in finite iterations has been studied in the literature for LISTA (Chen et al., 2018), Step-LISTA (Ablin et al., 2019), and shallow autoencoders (Arora et al., 2015; Rangamani et al., 2018; Nguyen et al., 2019; Tolooshams et al., 2020). We show that under proper initialization of the dictionary, the encoder achieves linear convergence. Arora et al. (2015) discussed some appropriate initialization which is used by Rambhatla et al. (2018). Given initial closeness δ0, the encoder selects and recovers the correct signed support of the code with high probability in one iteration B = 1 (Theorem 4.1), and the iterations preserve the correct support (Theorem 4.2). In spite of slow convergence of ISTA Liang et al. (2014), support recovery after one iteration in unrolled networks is studied in the literature (Arora et al., 2015; Rambhatla et al., 2018; Chen et al., 2018; Nguyen et al., 2019).

Theorem 4.1 (Forward pass support recovery). Given Assumptions 3.3 and 3.4, suppose D(l) is δl = O (1/ log p) close to D . If s = O ( m/µ log m), and µ = O(log m), then with probability of at least 1 ϵ(l) supp-rec, the choice of λ0 = Cmin/4 recovers the support of the code z in one encoder iteration, i.e., sign(Sαλ0(αD(l)Tx) = sign(z ), where ϵ(l) supp-rec = 2p exp ( C2 min O (δ2 l )).

Theorem 4.2 (Forward pass support preservation). Given Assumptions 3.3 and 3.4, suppose D(l) is δl = O (1/ log p) close to D . If s = O ( m/µ log m), µ = O(log m), and the regularizer and step size are

chosen such that λ(l) t = µl m z zt 1 + aγ = Ω( s log m m ) and α(l) 1 2λ(l) t (1 δ2 l 2 )Cmin λ(l) t 1 , then with probability

of at least 1 ϵ(l) supp-pres, the support, recovered at the first iteration, is preserved through the encoder iterations. We have aγ = O( sδl) and ϵ(l) supp-pres := ϵ(l) supp-rec + ϵ(l) γ = 2p exp ( C2 min O (δ2 l )) + 2s exp ( 1 O(δl)).

The support preservation conditions on λt and α introduce two insights. First, with an increase of t, the code error decrease, hence the lower bound on λt. Second, the decay of λt as the encoder unrolls increases the upper bound on α. Hence, we suggest a decaying strategy in values of λt as t increases.

The utilization of knowledge of the code error, as we do in Theorem 4.2, to set the proper thresholding/bias/regularization parameters (λ) constitutes a fairly standard practice. Below we discuss similar results in the literature. For the preservation of correct signed-support in a sparse coding network, Rambhatla et al. (2018) provided a proper thresholding value at every iteration as a function of the ℓ1-norm of the code error with respect to a ground-truth code; they additionally demonstrated an upper bound on the estimate of the code coefficients as a function of dictionary closeness. Moreover, Nguyen et al. (2019) used information on the range of ground-truth code to choose proper biases in their neural network to guarantee support recovery. Chen et al. (2018) similarly provided an upper bound on the bias of their unrolled sparse coding network at every layer as a function of ℓ2-norm error between the code estimate at the layer and the ground-truth code. Overall, the error between a code estimate and the ground-truth code appearing in the lower bound on λ(l) t can further simplified into terms related to terms such as the dictionary closeness δl, code sparsity. For example, Chatterji & Bartlett (2017), for their particular sparse coding algorithm, provided ℓ -norm upper bound as a function of terms such as code sparsity, data dimensionality, code range, and dictionary error.

Code convergence and error Given the support recovery and its preservation, the encoder convergence studied in (Malézieux et al., 2022) can achieve linear convergences after its first iteration. We re-state this result on the rate of convergence of the encoder in Theorem 4.3. We drop the superscript (l) to simplify the notation.

Published in Transactions on Machine Learning Research (08/2022)

0 50 100 Number of unfolding [t]

Figure 2: Code convergence (Theorem 4.3). As the network unrolls, zt converges to ˆz, the solution of lasso.

Theorem 4.3 (Local forward pass code convergence). Given the encoder zt+1 = Φ(zt, D), Assumption 3.6, Lemmas 3.2, A.1 and A.2, then ρ < 1, B > 0 s.t. zt ˆz 2 O(ρt) t > B, where ˆz is the unique minimizer of lasso (1). Furthermore, given Theorem 4.1 and Theorem 4.2, B = 1.

Theorem 4.3 shows that in PUDLE, zt converges to ˆz at a linear rate eventually after a certain number of unrolling (Figure 2). The local linear convergence of ISTA and FISTA (Beck & Teboulle, 2009) (with global rates of O(1/t) and O(1/t2)) in the neighbourhood of a fixed-point is studied in (Tao et al., 2016). The speed of convergence depends on when support selection happens (Proposition 4.1) (Bredies & Lorenz, 2008; Zhang et al., 2017b; Liang et al., 2014). We showed in Theorem 4.1 and Theorem 4.2 that under mild assumptions, the support is selected and recovered after one encoder iteration. In addition to local convergence, we focus on recovery of z and show error on the unrolled code coefficients z(l) t,(j) with respect to ground-truth z (j) as t increases. In Theorem 4.4, we consider the case where λt at layer t is set to according to Theorem 4.2; the bias decreases as the code error decreases among the layers and dictionary updates. We provide an upper bound on the coefficients errors as a function of code sparsity, dictionary error, and an unrolling error e(l)unroll t,j . The unrolling error goes to zero for appropriately large t. Moreover, Theorem 4.5 studies the case where the bias is fixed across the layers. In this scenario, we observe an additional term of λfixed in the upper bounds on the code coefficients error; this term shows that the code error when we strictly perform ℓ1-norm based sparse coding does not go to zero. We refer to this error as an amplitude bias estimate error. Theorem 4.4 (Global forward pass code error with variable λt). Given Assumptions 3.3 and 3.4, suppose D(l)

is µl-incoherent and δl = O (1/ log p) close to D . If s = O ( m/µ log m), µ = O(log m), and the regularizer

and step size are chosen such that λ(l) t = µl m z zt 1 + aγ = Ω( s log m m ) and α(l) 1 2λ(l) t (1 δ2 l 2 )Cmin λ(l) t 1 ,

then with probability of at least 1 ϵ(l) supp-pres, for j supp(z ), the code coefficient error is

|z(l) t,(j) z (j)| O( q

s D(l) j D j 2 + e(l)unroll t,j ) (8)

and z T,(j) = z (j)(1 β(l) j ) + ζ(l) T,j (9)

where e(l)unroll t,j := 2(s 1)tα µl m maxi |z(l) 0,(i) z (i)|δα,t 1 + |z(l) 0,(j) z (j)|δα,t, δα,t := (1 α + 2α µl m)t,

|ζ(l) T,j| = O(aγ) with aγ = O( sδl), β(l) j = D j D(l) j , D j δ2 l 2 and ϵ(l) supp-pres := ϵ(l) supp-rec + ϵ(l) γ =

2p exp ( C2 min O (δ2 l )) + 2s exp ( 1 O(δl)). With appropriately large t, |z(l) t,(j) z (j)| = O( q

s D(l) j D j 2).

Theorem 4.5 (Global forward pass code error with fixed λt). Given Assumptions 3.3 and 3.4, suppose D(l) is µl-incoherent and δl = O (1/ log p) close to D . If s = O ( m/µ log m), µ = O(log m), and the regularizer and step size are chosen such that λ(l) t = λfixed = µl m z z0 1 + aγ = Ω( s log m m ) and

α(l) 1 2λ(l) t (1 δ2 l 2 )Cmin λ(l) t 1 , then with probability of at least 1 ϵ(l) supp-pres, for j supp(z ), the code coefficient

error is |z(l) t,(j) z (j)| O( q

s D(l) j D j 2 + e(l)unroll,fixed t,j + λfixed) (10)

and z T,(j) = z (j)(1 β(l) j ) + ζ(l) T,j (11)

where e(l)unroll, fixed t,j := (s 1)tα µl m maxi |z(l) 0,(i) z (i)|δfixed α,t 1 + |z(l) 0,(j) z (j)|δfixed α,t , δfixed α,t := (1 α + α µl m)t,

|ζ(l) T,j| = O(aγ + λfixed) with aγ = O( sδl), β(l) j = D j D(l) j , D j δ2 l 2 , and ϵ(l) supp-pres := ϵ(l) supp-rec + ϵ(l) γ =

2p exp ( C2 min O (δ2 l )) + 2s exp ( 1 O(δl)). With appropriately large t, |z(l) t,(j) z (j)| = O( q

s D(l) j D j 2 + λfixed).

Published in Transactions on Machine Learning Research (08/2022)

Aside from code estimation where the network parameters (e.g., regularization and step size) are finely tuned according to support recovery and preservation conditions (Theorems 4.1 and 4.2), we provide a general upper bound on the error between the converged code and z ; the bound can be decomposed into two terms of the dictionary error and the biased amplitude estimate of the code.

Theorem 4.6 (Global forward pass code error). Let ˆz be the fixed-point of the encoder with iterations zt+1 = Φ(zt, D). Given Assumption 3.6, Lemmas 3.2, A.1 and A.2, we have ˆz z 2 O( D D 2 + ˆδ ), where ˆδ = ˆz z 2, ˆz is the unique minimizer of lasso (1) given the dictionary D, ˆz is the unique minimizer of lasso (1) given the dictionary D , and z is the ground-truth code.

This general decomposition is to emphasize that aside from the current estimate of the dictionary, the code error is a function of the forward pass algorithm used to solve the sparse coding problem. Specifically, the upper bound states that at the best scenario where there is access to the generating dictionary D , the forward pass solving lasso with fixed λ still gives a biased amplitude estimate of z . Overall, the assumptions to get this bound are mild; the bound is valid independent of successful support recovery or data distribution. With incorporation of data distribution and conditions stated in Theorem 4.1 and Theorem 4.2, the upper bound ˆδ can be replaced with terms involving λ, and reaches at zero as λ decays across forward iterations.

Jacobian convergence and error Following properties similar to those used in Theorem 4.3, and assuming Jt is bounded (Assumption 4.1), we show in Theorem 4.7 that, as the PUDLE unrolls, the code Jacobian Jt converges to ˆ J, the Jacobian of the solution of the lasso. The convergence of the Jacobian of proximal gradient descent is also studied in (Bertrand et al., 2021) for hyperparameter selection through implicit differentiation (Bengio, 2000), where the Jacobian is taken w.r.t to the hyperparameter λ as opposed to D.

Assumption 4.1 (Bounded Jacobian). The Jacobian is bounded, i.e., MJ > 0, s.t. Jt 2 MJ t.

Theorem 4.7 (Local forward pass Jacobian convergence). Given the recursion zt+1 = Φ(zt, D), and ˆz the unique minimizer of lasso with Jacobian ˆ J, then ρ < 1, B > 0 s.t. Jt ˆ J 2 O(tρt) t > B. Furthermore, given Theorem 4.1 and Theorem 4.2, B = 1.

The forward pass code and Jacobian convergences after support selection is similar to the results from (Malézieux et al., 2022). The highlights of our finding are that the order of upper bound convergences can be achieved from the first iteration of the encoder. In other words, we specify, in Theorem 4.1 and Theorem 4.2, the dictionary and data conditions such that the support can be recovered with B = 1. This resolves the instability issue discussed by Malézieux et al. (2022) in computation of the gradient gae-lasso t outside of the support. Finally, we show that the global Jacobian error is in the order of dictionary error.

Theorem 4.8 (Global forward pass Jacobian error). Let ˆz be the fixed-point of the encoder with iterations zt+1 = Φ(zt, D). Given Assumption 3.6, Lemmas 3.2, A.1 and A.2, we have ˆ J J 2 O( D D 2+ˆδ J), where ˆδ J := ˆ J J 2, and ˆ J, ˆ J and J are Jacobians corresponding to ˆz, ˆz and z .

4.2 Backward pass

0 50 100 Number of unfolding [t]

gdec t gae lasso t gae ls t

Figure 3: Convergence rate of gradients (Theorem 4.9).

We show two results for local gradient ˆg and global gradient g

convergence. The goal is not to provide a finite sample analysis but to emphasize the relative differences between the gradients in Algorithm 2. The impact of gradient error for parameter estimation in the convex setting has been studied by Devolder et al. (2013) indicating that the convergence to the parameter s neighbourhood is dictated by the gradient error (Devolder et al., 2013; 2014). As dictionary learning is a bi-convex problem, findings of Devolder et al. (2013) hold as well for better estimation of the local dictionary at every step of alternating minimization. Moreover, Arora et al. (2015), provided a detailed analysis of sparse coding and various gradient estimations for dictionary learning, showing that by computing a more accurate gradient at every step of the alternating minimization scheme, the dictionary estimates

Published in Transactions on Machine Learning Research (08/2022)

converge to a closer neighbourhood of D . Overall, the intuition is that the size of the gradient error dictates the size of the neighbourhood of the dictionary within which one can guarantee convergence. We argue that the method with lower gradient error recovers the dictionary better.

Local gradient estimations We highlight the effect of finite unrolling on the gradient for parameter estimation (Ablin et al., 2020). Theorem 4.9 shows the convergence rate of gradients to ˆg, determining the similarity of PUDLE and Algorithm 1.

Theorem 4.9 (Local convergence of gradients). Given the forward pass convergence results (Theorems 4.3 and 4.7), ρ < 1, B > 0 such that t > B, the errors of gradients defined in Algorithm 2 w.r.t ˆg (4) satisfy

gdec t ˆg 2 O(ρt)

gae-lasso t ˆg 2 O(tρ2t)

gae-ls t ˆg 2 O(tρ2t + MJλ s).

Moreover, the order of upper bounds is tight (see Lemma A.4).

First, upper bounds on the errors related to gdec t and gae-lasso t go to zero as t increases. Hence, both gradients converge to ˆg. This means that asymptotically as t increases, training PUDLE with gdec t and gae-lasso t is equivalent to classical alternating-minimization (Algorithm 1). Second, as t increases, gae-lasso t has faster convergence than gdec t . Lastly, gae-ls t is a biased estimator of ˆg (Figure 3). The convergence results on the error gae-lasso t ˆg 2 is previously studied by Malézieux et al. (2022).

Given the above convergence results, one may conclude that gae-lasso t should be used for dictionary recovery. However, we show next that for dictionary recovery, the gradient gae-lasso t , used by Malézieux et al. (2022), is indeed a biased estimator of the global gradient g for recovery of D . We decrease this bias by replacing gae-lasso t with gae-ls t and show that gae-ls t results in a better recovery of D than gae-lasso t .

Global gradient estimations Theorem 4.10 shows the global gradient errors w.r.t g from (5). We omit the gradient gdec t , as it is asymptotically equivalent to gae-lasso t . We study the errors in the limit to unrolling, i.e., as t . This determines which PUDLE gradients recover D better (Devolder et al., 2013; 2014).

Theorem 4.10 (Global error of gradients). Given the convergence results from the forward pass, (Theorems 4.6 and 4.8), the errors of gradients defined in Algorithm 2 w.r.t global direction g (defined in (5)) satisfy

gae-lasso g 2 O( D D 2 2 + D D 2 + ˆδ + ˆδ J + MJλ s)

gae-ls g 2 O( D D 2 2 + D D 2 + ˆδ + ˆδ J). (13)

Several factors affect the order of upper bounds: the current estimate of the dictionary, code amplitude-bias error due to ℓ1 norm, and the usage of ℓ1 norm in the loss used for backpropagation. To study the bias in the gradient computation, let consider the scenario where D = D . We denote those gradients by superscript D . If the gradients are not biased, then the upper bounds should goes to zero. The gradient errors are

gae-lasso,D g 2 O(ˆδ + ˆδ J + MJλ s) and gae-ls,D g 2 O(ˆδ + ˆδ J). (14)

For gae-ls,D , the radius of the error ball is only a function of the amplitude error of the code estimated through lasso compare to the ground-truth code z . However, the error ball for the gradient gae-lasso,D includes an additional term concerning the usage of lasso loss containing the regularization term λ. This implies that the D neighbourhood at which the gradient gae-ls,D is guaranteed to converge to is smaller than of the gae-lasso,D (Figure 4a). Implications of such gradient estimation are seen in dictionary learning where gae-ls recovers D better (Figures 4b and 4c). In Figure 4b, the encoder unrolls for T = 25, hence the phenomenon of implicit acceleration is seen in faster and better dictionary learning performance of gae-lasso than gdec . In Figure 4c where T = 100, similar performance of gdec and gae-lasso illustrates their asymptotic equivalence as t (See Appendix for additional noisy dictionary learning experiments where the measurements x are corrupted with zero-mean Gaussian noise such that the Signal-to-Noise-Ratio is approximately 12 SNR; in this setting, the aforementioned comparative analysis still holds.)

Published in Transactions on Machine Learning Research (08/2022)

0 50 100 Number of unfolding [t]

(a) Convergence for g .

0 250 500 Number of epochs

||D D ||2 ||D ||2

(b) Learning (T = 25).

0 250 500 Number of epochs

||D D ||2 ||D ||2

gdec t gae lasso t gae ls t

(c) Learning (T = 100).

Figure 4: Results for PUDLE s global convergence (Theorem 4.10) and dictionary learning.

Towards unbiased estimation As long as λ is fixed within PUDLE, all defined gradients remain biased estimators of g , due to the biased estimate of the code z through ℓ1 norm. This bias exists while dictionary learning is performed strictly using lasso through Algorithm 1. Given the conditions on the regularizer in Theorem 4.2 which we discussed in Section 4.1 and the derived upper bounds in Theorem 4.10, we suggest the decaying of λ across the encoder to reduce the gradient biases and improve dictionary learning. Next, we prove in Theorem 4.11 that PUDLE converges to D if λ decays across the layers t according to Theorem 4.4. Moreover, Theorem 4.12 proves that if λ stays fixed according to Theorem 4.5, then PUDLE only guarantees to converge to a close neighbourhood of the dictionary. In these analyses, we focus on gdec T . Furthermore, we show in Section 4.3 that by decaying λ at each unrolled layer, the gradient bias vanishes, and we recover D .

Dictionary learning Given the network parameters set by Theorem 4.4, Theorem 4.11 proves that using gdec T , PUDLE recovers the dictionary; the dictionary error contracts at every update. Moreover, Theorem 4.12 proves that as long as λ stays fixed across the unrolled layers, PUDLE guarantees to converge to only D

neighbourhood characterized by the regularization parameter λ. These analyses requires for D(l) to maintain a closeness to D which we provide a proof for in Lemma A.7. Hence, the dictionary closeness assumption (Assumption 3.5) stays valid.

Theorem 4.11 (Dictionary learning with variable λt). Given Assumptions 3.3 and 3.4, suppose D(l) is µl-incoherent and (δl, 2)-close to D with δl = O (1/ log p). If s = O( m), µ = O(log m), learning rate is η = O( p s(1 δ2 l /2)), and the regularizer and step size are set according to Theorem 4.4, then for any dictionary

update l using gdec T , with probability of at least 1 ϵ(l) supp-pres,

D(l+1) j D j 2 2 (1 ψ) D(l) j D j 2 2 (15)

where ϵ(l) supp-pres := ϵ(l) supp-rec + ϵ(l) γ = 2p exp ( C2 min O (δ2 l )) + 2s exp ( 1 O(δl)).

Theorem 4.12 (Dictionary learning with fixed λt). Given Assumptions 3.3 and 3.4, suppose D(l) is µlincoherent and (δl, 2)-close to D with δl = O (1/ log p). If s = O( m), µ = O(log m), learning rate is η = O( p s(1 δ2 l /2)), and the regularizer λfixed and step size are set according to Theorem 4.5, then for any

dictionary update l using gdec T , with probability of at least 1 ϵ(l) supp-pres,

D(l+1) j D j 2 2 (1 ψ) D(l) j D j 2 2 + ϵ(l) λ (16)

where ϵ(l) λ := η 2p s(1 β(l) j )λfixed2, ϵ(l) supp-pres := ϵ(l) supp-rec + ϵ(l) γ = 2p exp ( C2 min O (δ2 l )) + 2s exp ( 1 O(δl)).

4.3 Experiments

Dictionary learning We focus on the performance of the best-performing gradient estimator gae-ls t , and compare it with NOODL (Rambhatla et al., 2018), a state-of-the-art online dictionary learning algorithm, and SPORCO (Wohlberg, 2017), an alternating-minimization dictionary learning algorithm that uses lasso. NOODL, which uses iterative hard-thresholding (HT) for sparse coding and a gradient update

Published in Transactions on Machine Learning Research (08/2022)

employing the code s sign, has linear convergence upon proper initialization (Rambhatla et al., 2018). We note that the results from gae-lasso t are not shown, as the gradient computation was unstable (Malézieux et al., 2022). We emphasize that our proposed gradient gae-ls t does not suffer such instability. We train:

0 250 500 750 Number of iterations

||D D ||2 ||D ||2

gae ls t gae ls,decay t gae ls,HT t NOODL SPORCO

Figure 5: Dictionary convergences.

gae-ls t : λ is fixed across iterations. gae-ls, decay t : λ decays (i.e., λt = λνt, with 0 < ν < 1) where ν decreases as training progresses. gae-ls, HT t : Pαλ(v) is replaced with HTb(v) v1|v| b.

With HT, the sparse coding step reduces to that from NOODL. In this case, we highlight the difference between the gradient update of our method (backpropagation) with NOODL. We focus on convergence, as η across methods is not comparable.

Figure 5 shows the convergence of D R1000 1500 to D when the code is 20-sparse (for other sparsity levels and details see Appendix C). A biased estimate of the code amplitudes results in convergence only to a neighbourhood of the dictionary (Rambhatla et al., 2018). This is observed in the convergence of gae-ls t and SPORCO (final error is shown). The convergence of gae-ls t to a closer neighbourhood than SPORCO supports Theorem 4.10. Moreover, with decaying λ, the code bias vanishes, hence gae-ls, decay t and gae-ls, HT t converges to D similar to NOODL.

0 100 200 300 400 Number of epochs

gdec t gae lasso t gae ls t gae ls, HT t

(a) gae-lasso t training is not stable.

0 100 200 300 400 Number of epochs

(b) gae-ls t improvement is smoother than gdec t .

Figure 6: Networks behaviour (test PSNR) during training as a function of epochs.

Image denoising To further highlight the advantage of gae-ls t over the other gradients, we compare them in a supervised task of image denoising. In addition to gae-ls t , gae-lasso t , and gdec t , we consider gae-ls, HT t where the proximal operator is replaced with HT. This is to compare with sparse coding scheme of NOODL. We do not compare against NOODL s dictionary update, as this computation for two-dimensional convolutions is not straightforward. Prior works have shown that variants of PUDLE either rival or outperform state-of-the-art architectures (Simon & Elad, 2019; Tolooshams et al., 2020). Thus, we focus on a comparative analysis of the gradients. We trained on 432 and tested on 68 images from BSD (Martin et al., 2001). BSD dataset is a popular training dataset for denoising (Zhang et al., 2017a; Simon & Elad, 2019; Mohan et al., 2019). We used a convolutional dictionary and corrupted images with zero-mean Gaussian noise of standard deviation of 25 (see Appendix C for details). We initialized the dictionary filters by standard Normal distribution; this is to follow the norm in the deep learning literature and to demonstrate the practicality and usefulness of PUDLE in the absence of an initialization method. We evaluate the denoising performance of soft-thresholding using λ and HT with b in peak signal-to-noise-ratio (PSNR).

Published in Transactions on Machine Learning Research (08/2022)

First, we highlight the stability of gae-ls t against gae-lasso t ; Figure 6a shows the network dynamics in terms of test PSNR as a function of epochs when λ = 0.16 for gae-ls t , gae-lasso t , gdec t and b = 0.05 for gae-ls, HT t . We observed that gae-ls t uses full backpropagation and stays stable. However, the training with gae-lasso t is not stable and unstable to perform denoising where the noisy PSNR is approximately 20 d B (Malézieux et al., 2022). Second, Figure 6b shows that compared to gdec t , the backpropagated gradients result in a smoother improvement during training. Moreover, Table 1 shows that the advantage of gae-ls t over gdec t is not limited to dictionary learning and is seen in denoising. We have excluded the results for gae-lasso t from Table 1 as the network failed to denoise (see Figure 6a). Additionally, the superior performance of gae-ls t compared to gae-ls, HT t highlights the benefits of PUDLE (i.e., ℓ1-based unrolling) against HT used in NOODL.

Table 1: Denoising of BSD68. Reported numbers are mean (std) PSNR given three independent trials.

METHOD PSNR [d B]

λ 0.08 0.12 0.16 0.2

gdec t 24.21 (0.12) 24.93 (0.14) 25.25 (0.06) 24.88 (0.00) gae-ls t 24.79 (0.03) 25.43 (0.03) 25.63 (0.04) 25.46 (0.05)

b 0.02 0.05 0.08 0.1

gae-ls, HT t 22.92 (0.07) 25.26 (0.1) 24.76 (0.06) 23.94 (0.13)

5 Interpretable Sparse Codes and Dictionary

One motivation behind using algorithm unrolling to design deep architectures is interpretability (Monga et al., 2019); they argue that the designed networks are interpretable as they capture domain knowledge via an optimization model. For example, Tolooshams et al. (2021a) takes advantage of the interpretability of learned weights in an unrolled dictionary learning network to solve spike sorting, an unsupervised source separation problem in computational neuroscience. Moreover, Kim et al. (2010) uses sparse coding to learn interpretable representations of human motions. However, none of the existing methods in the literature provide interpretability results that open the black-box network through building a mathematical relation between the learned dictionary, training data, and test representation/reconstruction. This section analyzes the interpretability of the unrolled sparse coding method in this context. We note that such mathematical relation and interpretability results also hold for dictionary learning. However, it is missing in the literature, irrespective of whether one uses an unrolling network. We provide the following theorem.

Figure 7: Fraction of dictionary atoms learned from {0, 1, 2, 3, 4} MNIST.

Theorem 5.1 (Interpretable unrolled network). Consider the dictionary learning optimization of the form min Z,D 1 2 X DZ 2 F + λ Z 1 + ω/2 D 2 F , where X = [x1, x2, . . . , xn] Rm n and Z = [z1, z2, . . . , zn] Rp n. Let Z be the given converged sparse codes, then stationary points of the problem w.r.t the network weights (dictionary) follows D = XG 1 ZT, where we denote G := ( ZT Z + ωI)la.

The dictionary interpolates the training data Given Theorem 5.1, each learned atom interpolates the training data, i.e.,

Dj = X(G 1wj) =

k=1 (G 1wj)kxk (17)

where wj = [ z1 j , z2 j , . . . , zn j ]T Rn is a vector containing the training code activity for dictionary atom j. Specifically, the importance of training image xk in learning dictionary atom j is captured by the term (G 1wj)k. This proves the dictionary lives in the spans of the

Published in Transactions on Machine Learning Research (08/2022)

training set. Given the small number of atoms compared to the training size, (17) shows that the dictionary summarizes the training examples. We trained the network on digits of {0, 1, 2, 3, 4} MNIST (Figure 7 shows a fraction of the most used learned atoms). Figure 8 visualizes dictionary atoms along with training images with the highest contribution (green) and the lowest contribution (red). In addition, we used (17) on the partial training data to reconstruct learned atoms (shown as Estimate). Next, we interpret the relation between a new data to the training data using representer point selection, similar to (Yeh et al., 2018).

(a) 0 looking like atom.

(b) 4 looking like atom.

(c) 3 looking like atom.

(d) 1 looking like atom.

Figure 8: Training image contributions to learning the dictionary.

(a) 0 test image.

(b) 1 test image.

(c) 3 test image.

(d) 4 test image.

Figure 9: Interpolation of training data to reconstruct a new image. Contribution of training images are shown in green (high contribution) and red (low contribution). βj is normalized over the used examples.

Relation between new test image and training data For representation of a new data, we observe that the reconstruction of a new example xj is a linear combination of all the training examples, i.e.,

ˆxj = Dˆzj = Xβj =

k=1 βj kxk (18)

where ˆxj denotes reconstruction, ˆzj is the code estimate, βj = G 1 ZT ˆzj Rn, and βj k = Pn a=1 G 1 ka za, ˆzj . We observe that the contribution of image k into the reconstruction of the test image is a function of βj k, and the energy of βj k itself depends on the whole training set, and G 1. (18) shows how each image is reconstructed as interpolation of the training images. Figure 9 shows this results, where images with high (green) βj k contribution are similar to the test image and those with low (red) βj k contribution are different. In addition, we can evaluate the overall quality of the reconstruction by looking into βj k in (18). For example, we observed that for test MNIST, unnormalized βj k corresponding to high contributing training images is above 1. However, for resized-CIFAR, unnormalized βj k of high contributing training images are often half or an order of magnitude lower than the MNIST case. This informs us of a bad representation/reconstruction of CIFAR image by the trained network. From another perspective, we can write the new image as

xj = Dˆzj =

k=1 (XG 1)k zk, ˆzj (19)

i.e., the contribution of each training image for reconstruction is a function of their code similarity to the new image and properties of the Gram matrix of training set code similarities. Specifically, the relation rules the contribution of transformed image k (i.e., (XG 1)k) into reconstruction of the test image as a function of its code similarity zk, ˆzj . In other words, (19) shows that training images with the highest code similarity to the representation of the new image have the highest contribution to its reconstruction. This interpretation is demonstrated in Figure 10. The training images with the highest code similarity (green) and the lowest

Published in Transactions on Machine Learning Research (08/2022)

0 1 Similarity

0 1 Similarity

0 1 Similarity

0 1 Similarity

0 1 Similarity

(a) Digit 1 test image.

0 1 Similarity

0 1 Similarity

0 1 Similarity

0 1 Similarity

0 1 Similarity

(b) Digit 2 test image.

Figure 10: Contribution of images with code similarity into reconstruction of a new test image along with the histograms of the similarity of the test code to training codes from each class.

similarity (red) are shown. In addition, the figure demonstrates the histogram of the code similarity between the test image and the training set, grouped by their class digit. For example, for digit 1 test image, its code similarity to train images from class 1 are bimodal. This corresponds to 1 digits that are tilted to the left (low similarity) and right (high similarity). Moreover, for digit 2 test image, we observe that the histogram of images corresponding to digit 2 are shifted the most to the right (highest similarity) than the other classes.

6 Conclusions

This paper studied dictionary learning and analyzed the dynamics of unrolled sparse coding networks through a provable unrolled dictionary learning (PUDLE) framework. First, we provided a theoretical analysis of the forward pass for code recovery and dictionary learning. We discussed the bias introduced by ℓ1-based sparse coding in the forward pass, and how this affects the dictionary estimate in the backward pass. Second, we showed strategies to mitigate the propagation of this code bias into the backward pass; this is achieved by modification of the training loss function. We demonstrated that this bias could be further reduced and eliminated by decaying the regularization parameter within the unrolled layers. Additionally, we provided sufficient conditions on the data distribution and network to guarantee stability of backpropagated gradient computations. In the absence of such conditions, we proposed a modification to the loss function that resolves the gradient explosion and allows stable learning. In an image denoising task, we showed PUDLE outperforms the NOODL sparse coding scheme (Rambhatla et al., 2018). Motivated by interpretability as a popular feature for unrolled networks, we derived a mathematical relation between the network weights (dictionary) and the training set. We proved that the network weights live in the span of the training set, and constructed a relation between predictions of new input examples and the training set. The latter allows the user to extract images from the training set that are similar/dissimilar to the input image in representation/reconstruction.

Acknowledgments

Demba Ba and Bahareh Tolooshams acknowledge the partial support of the ARO under grant number W911NF-16-1-0368. This is part of a collaboration between US DOD, UK MOD and UK Engineering and Physical Research Council (EPSRC) under the Multidisciplinary University Research Initiative. The authors would also like to thank the reviewers for their feedback, led to the improvement of the manuscript.

Pierre Ablin, Thomas Moreau, Mathurin Massias, and Alexandre Gramfort. Learning step sizes for unfolded sparse coding. In Proceedings of Advances in Neural Information Processing Systems, volume 32, pp. 1 11, 2019.

Pierre Ablin, Gabriel Peyré, and Thomas Moreau. Super-efficiency of automatic differentiation for functions defined as a minimum. In Proceedings of International Conference on Machine Learning, pp. 32 41. PMLR, 2020.

Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli, and Rashish Tandon. Learning sparsely used overcomplete dictionaries. In Maria Florina Balcan, Vitaly Feldman, and Csaba Szepesvári (eds.), Proc the 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pp. 123 137, Barcelona, Spain, 13 15 Jun 2014. PMLR.

Published in Transactions on Machine Learning Research (08/2022)

Alekh Agarwal, Animashree Anandkumar, and Praneeth Netrapalli. A clustering approach to learning sparsely used overcomplete dictionaries. IEEE Transactions on Information Theory, 63(1):575 592, 2017. doi: 10.1109/TIT.2016.2614684.

M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311 4322, 2006.

Kazunori Akiyama, Kazuki Kuramochi, Shiro Ikeda, Vincent L Fish, Fumie Tazaki, Mareki Honma, Sheperd S Doeleman, Avery E Broderick, Jason Dexter, Monika Mościbrodzka, et al. Imaging the schwarzschildradius-scale structure of m87 with the event horizon telescope using sparse modeling. The Astrophysical Journal, 838(1):1, 2017.

Sanjeev Arora, Rong Ge, and Ankur Moitra. New algorithms for learning incoherent and overcomplete dictionaries. In Maria Florina Balcan, Vitaly Feldman, and Csaba Szepesvári (eds.), Proceedings of the 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pp. 779 806, Barcelona, Spain, 13 15 Jun 2014. PMLR.

Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, and neural algorithms for sparse coding. In Peter Grünwald, Elad Hazan, and Satyen Kale (eds.), Proceedings of Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pp. 113 149, Paris, France, 03 06 Jul 2015. PMLR.

Hedy Attouch and Jérôme Bolte. On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Mathematical Programming, 116(1):5 16, 2009.

Waheed U. Bajwa, Kfir Gedalyahu, and Yonina C. Eldar. Identification of parametric underspread linear systems and super-resolution radar. IEEE Transactions on Signal Processing, 59(6):2548 2561, 2011.

Boaz Barak, Jonathan A. Kelner, and David Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method. In Proceedings of Annual ACM Symposium on Theory of Computing, STOC 15, pp. 143 151, 2015.

Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18, 2018.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, 2009.

Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889 1900, 2000.

Quentin Bertrand, Quentin Klopfenstein, Mathurin Massias, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, and Joseph Salmon. Implicit differentiation for fast hyperparameter selection in non-smooth convex learning. ar Xiv:2105.01637, 2021.

Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications, 14(5-6):629 654, 2008.

Kristian Bredies and Dirk A Lorenz. Linear convergence of iterative soft-thresholding. Journal of Fourier Analysis and Applications, 14(5-6):813 837, 2008.

Emmanuel J. Candès and Yaniv Plan. Near-ideal model selection by ℓ1 minimization. The Annals of Statistics, 37(5A):2145 2177, 2009.

Niladri S Chatterji and Peter L Bartlett. Alternating minimization for dictionary learning: Local convergence guarantees. ar Xiv:1711.03634, pp. 1 26, 2017.

Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review, 43(1):129 159, 2001.

Published in Transactions on Machine Learning Research (08/2022)

Xiaohan Chen, Jialin Liu, Zhangyang Wang, and Wotao Yin. Theoretical linear convergence of unfolded ista and its practical weights and thresholds. In Proceedings of Advances in Neural Information Processing Systems, volume 31, pp. 1 11, 2018.

Brian Cleary, Le Cong, Anthea Cheung, Eric S. Lander, and Aviv Regev. Efficient generation of transcriptomic profiles by random composite measurements. Cell, 171(6):1424 1436.e18, 2017. ISSN 0092-8674.

Brian Cleary, Brooke Simonton, Jon Bezney, Evan Murray, Shahul Alam, Anubhav Sinha, Ehsan Habibi, Jamie Marshall, Eric S Lander, Fei Chen, et al. Compressed sensing for highly efficient imaging transcriptomics. Nature Biotechnology, pp. 1 7, 2021.

I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11):1413 1457, 2004.

Olivier Devolder, François Glineur, Yurii Nesterov, et al. First-order methods with inexact oracle: the strongly convex case. Technical report, UniversitÃ catholique de Louvain, Center for Operations Research and . . . , 2013.

Olivier Devolder, François Glineur, and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1):37 75, 2014.

Michael Elad. Sparse and redundant representations: from theory to applications in signal and image processing. Springer Science & Business Media, 2010.

Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12):3736 3745, 2006.

K. Engan, S.O. Aase, and J. Hakon Husoy. Method of optimal directions for frame design. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pp. 2443 2446 vol.5, 1999.

Matthias Feurer and Frank Hutter. Hyperparameter optimization. In Automated Machine Learning, pp. 3 33. Springer, Cham, 2019.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradientbased hyperparameter optimization. In Proceedings of International Conference on Machine Learning, pp. 1165 1173, 2017.

Jean Charles Gilbert. Automatic differentiation and iterative processes. Optimization methods and software, 1(1):13 21, 1992.

Raja Giryes, Yonina C. Eldar, Alex M. Bronstein, and Guillermo Sapiro. Tradeoffs between convergence speed and reconstruction accuracy in inverse problems. IEEE Transactions on Signal Processing, 66(7): 1676 1690, 2018.

Karol Gregor and Yann Le Cun. Learning fast approximations of sparse coding. In Proceedings of international conference on international conference on machine learning, pp. 399 406, 2010.

Elaine T Hale, Wotao Yin, and Yin Zhang. A fixed-point continuation method for l1-regularized minimization with applications to compressed sensing. CAAM TR07-07, Rice University, 43:44, 2007.

Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.

John R. Hershey, Jonathan Le Roux, and Felix Weninger. Deep unfolding: Model-based inspiration of novel deep architectures. ar Xiv:1409.2574, pp. 1 27, 2014.

Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of Annual ACM Symposium on Theory of Computing, pp. 665 674, 2013. ISBN 9781450320290.

Published in Transactions on Machine Learning Research (08/2022)

Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, and Francis Bach. Proximal methods for hierarchical sparse coding. The Journal of Machine Learning Research, 12:2297 2334, 2011.

Taehwan Kim, Gregory Shakhnarovich, and Raquel Urtasun. Sparse coding for learning interpretable spatio-temporal primitives. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta (eds.), Proceedings of Advances in Neural Information Processing Systems, volume 23, 2010.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv:1412.6980, 2014.

Yann A Le Cun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pp. 9 48. Springer, 2012.

Yuelong Li, Mohammad Tofighi, Junyi Geng, Vishal Monga, and Yonina C. Eldar. Efficient and interpretable deep blind image deblurring via algorithm unrolling. IEEE Transactions on Computational Imaging, 6: 666 681, 2020.

Jingwei Liang, Jalal Fadili, and Gabriel Peyré. Local linear convergence of forward backward under partial smoothness. Proceedings of Advances in neural information processing systems, 27, 2014.

Jialin Liu and Xiaohan Chen. Alista: Analytic weights are as good as learned weights in lista. In Proceedings of International Conference on Learning Representations, 2019.

Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse coding. In Proceedings of Annual International Conference on Machine Learning, pp. 689 696, 2009a.

Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis Bach. Supervised dictionary learning. In Proceedings of Advances in Neural Information Processing Systems, volume 21, pp. 1 8, 2009b.

Benoît Malézieux, Thomas Moreau, and Matthieu Kowalski. Understanding approximate and unrolled dictionary learning for pattern recovery. In Proceedings of International Conference on Learning Representations, 2022.

D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of IEEE International Conference on Computer Vision, volume 2, pp. 416 423, 2001.

Sreyas Mohan, Zahra Kadkhodaie, Eero P Simoncelli, and Carlos Fernandez-Granda. Robust and interpretable blind image denoising via bias-free convolutional neural networks. In Proceedings of International Conference on Learning Representations, 2019.

Vishal Monga, Yuelong Li, and Yonina C Eldar. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing. ar Xiv:1912.10557, pp. 1 27, 2019.

Thomas Moreau and Joan Bruna. Understanding trainable sparse coding via matrix factorization. In Proceedings of 5th International Conference on Learning Representations, pp. 1 13, 2017.

Thanh V Nguyen, Raymond KW Wong, and Chinmay Hegde. On the dynamics of gradient descent for autoencoders. In Proceedings of International Conference on Artificial Intelligence and Statistics, pp. 2858 2867. PMLR, 2019.

Kenji Nose-Filho, Andre Kazuo Takahata, Renato Lopes, and Joao Marcos Travassos Romano. Improving sparse multichannel blind deconvolution with correlated seismic data: Foundations and further results. IEEE Signal Processing Magazine, 35(2):41 50, 2018.

Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311 3325, 1997.

Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):127 239, 2014.

Published in Transactions on Machine Learning Research (08/2022)

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

Sirisha Rambhatla, Xingguo Li, and Jarvis Haupt. Noodl: Provable online dictionary learning and sparse coding. In Proceedings of International Conference on Learning Representations, pp. 1 11, 2018.

Akshay Rangamani, Anirbit Mukherjee, Amitabh Basu, Ashish Arora, Tejaswini Ganapathi, Sang Chin, and Trac D. Tran. Sparse coding and autoencoders. In Proceedings of IEEE International Symposium on Information Theory (ISIT), pp. 36 40, 2018.

Marc aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann Cun. Efficient learning of sparse representations with an energy-based model. In Advances in Neural Information Processing Systems, volume 19. MIT Press, 2007.

Marc aurelio Ranzato, Y-lan Boureau, and Yann Cun. Sparse feature learning for deep belief networks. In J. Platt, D. Koller, Y. Singer, and S. Roweis (eds.), Proceedings of Advances in Neural Information Processing Systems, volume 20, 2008.

Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin classifier. The Journal of Machine Learning Research, 5:941 973, 2004.

Christian J. Schuler, Michael Hirsch, Stefan Harmeling, and Bernhard Schölkopf. Learning to deblur. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(7):1439 1451, 2016.

Dror Simon and Michael Elad. Rethinking the csc model for natural images. In Proceedings of Advances in Neural Information Processing Systems, volume 32, pp. 1 11, 2019.

Oren Solomon, Regev Cohen, Yi Zhang, Yi Yang, Qiong He, Jianwen Luo, Ruud J. G. van Sloun, and Yonina C. Eldar. Deep unfolded robust pca with application to clutter suppression in ultrasound. IEEE Transactions on Medical Imaging, 39(4):1051 1063, 2020.

Daniel A. Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. In Proceedings of Annual Conference on Learning Theory, volume 23 of PMRL, pp. 37.1 37.18, 2012.

Pablo Sprechmann, Alex Bronstein, and Guillermo Sapiro. Learning efficient structured sparse models. In Proceedings of International Coference on International Conference on Machine Learning, pp. 219 226, 2012.

Shaozhe Tao, Daniel Boley, and Shuzhong Zhang. Local linear convergence of ista and fista on the lasso problem. SIAM Journal on Optimization, 26(1):313 336, 2016.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267 288, 1996. ISSN 00359246.

Ryan J. Tibshirani. The lasso problem and uniqueness. Electronic Journal of Statistics, 7(none):1456 1490, 2013.

Bahareh Tolooshams, Sourav Dey, and Demba Ba. Scalable convolutional dictionary learning with constrained recurrent sparse auto-encoders. In 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1 6, 2018.

Bahareh Tolooshams, Andrew Song, Simona Temereanca, and Demba Ba. Convolutional dictionary learning based auto-encoders for natural exponential-family distributions. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 9493 9503. PMLR, 13 18 Jul 2020.

Bahareh Tolooshams, Sourav Dey, and Demba Ba. Deep residual autoencoders for expectation maximizationinspired dictionary learning. IEEE Transactions on Neural Networks and Learning Systems, 32(6):2415 2429, 2021a.

Published in Transactions on Machine Learning Research (08/2022)

Bahareh Tolooshams, Satish Mulleti, Demba Ba, and Yonina C Eldar. Unfolding neural networks for compressive multichannel blind deconvolution. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2890 2894. IEEE, 2021b.

Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183 2202, 2009.

Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, and Thomas Huang. Deep networks for image superresolution with sparse prior. In Proceedings of IEEE International Conference on Computer Vision, pp. 370 378, 2015.

Brendt Wohlberg. Sporco: A python package for standard and convolutional sparse representations. In Proceedings of the 15th Python in Science Conference, Austin, TX, USA, pp. 1 8, 2017.

Bo Xin, Yizhou Wang, Wen Gao, David Wipf, and Baoyuan Wang. Maximal sparsity with deep networks? In Proceedings of Advances in Neural Information Processing Systems, volume 29, pp. 1 9, 2016.

Jianchao Yang, John Wright, Thomas S. Huang, and Yi Ma. Image super-resolution via sparse representation. IEEE Transactions on Image Processing, 19(11):2861 2873, 2010.

Chih-Kuan Yeh, Joon Sik Kim, Ian EH Yen, and Pradeep Ravikumar. Representer point selection for explaining deep neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 9311 9321, 2018.

Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing, 26(7):3142 3155, 2017a.

Lufang Zhang, Yaohua Hu, Chong Li, and Jen-Chih Yao. A new linear convergence result for the iterative soft thresholding algorithm. Optimization, 66(7):1177 1189, 2017b.

A Appendix - proofs

A.1 Notation

Bold-lower-case and upper-case letters refer to vectors d and matrices D. We use d(j) to denote the jth

element of the vector d, and Dj is the jth column of the matrix D. We denote the code estimate at unrolled layer t by zt. λ > 0 is the regularization (sparsity-enforcing) parameter. σmax(D) is the maximum singular value of D. When taking the derivatives or norms w.r.t the matrix D, we assume that D is vectorized. 1Lx(z, D) and 2Lx(z, D) are the first derivatives of the loss w.r.t z and D, respectively. 2 11Lx(z, D) is the second derivative of the loss w.r.t z. 2 21Lx(z, D) is the derivative of 1Lx(z, D) w.r.t D. The support of z is supp(z) {j : z(j) = 0}.

A.2 Basic definitions and Lemmas

We list four definitions used throughout the paper below.

Definition A.1 (µ-incoherence). D is µ-incoherent, i.e., for every pair (i, j) of columns, | Di, Dj | µ/ m.

Definition A.2 ((δ, κ)-closeness). Dictionary D is δ-close to D , i.e., there is a permutation π and sign flip operator u such that i u(i)Dπ(i) D i 2 δ. Additionally, D D 2 κ D 2.

Definition A.3 (Lipschitz function). A function f : Rm Rp is L-Lipschitz w.r.t a norm if L > 0 s.t. f(a) f(b) L a b a, b Rm.

Definition A.4 (Lipschitz differentiable function). A twice differentiable function f : Rm Rp is L-Lipschitz differentiable w.r.t a norm iff L > 0 s.t. 2f(a) L a Rm.

Definition A.5 (Strong convexity). A twice differentiable function f : Rm Rp is strongly convex if µ > 0 s.t. 2f(a) µI.

Published in Transactions on Machine Learning Research (08/2022)

Definition A.6 (Norm of subgradient). For norms involving subgradents, we define h(z) := maxv h(z) v .

In the proof of the theorems, we use the strong convexity of the reconstruction loss after support selection and the bounded property of the Lipschitz mapping stated below.

Lemma A.1 (Strong convexity of reconstruction loss). Given the support selection (Proposition 4.1), DT S DS is full-rank. Thus, t > B, Lx(zt, D) = Lx(zt,S, DS) is strongly convex (Definition A.5) in z.

Lemma A.2 (Lipschitz mapping). Given the recursion zt+1 = Φ(zt) = Pαλ(zt α 1Lx(zt, D)), from Lemma A.1, there exist B > 0 such that loss Lx(zt, D) is µ-strongly convex t > B. Hence, using Lemma 3.4, 1Φ(zt, D) 2 = (I α 2 11Lx(zt, D)) Pαλ(zt) 2 ρ (20)

where ρ cprox(1 αµ) < 1.

One key term, used in the proofs, is that 0 1Lx(ˆz, D) + h(ˆz) which is followed by the lasso optimality, i.e.,

Lemma A.3 (Lasso optimality). Lasso Karush-Kuhn-Tucker (KKT) optimality conditions are

ˆz arg min z Rp fx(z, D) DT(x Dˆz) λ ˆz 1, and |ˆz(j)| =

( {sign(ˆz(j)} if ˆz(j) = 0 [ 1, 1] if ˆz(j) = 0 , j {1, 2, . . . , p}.

A.3 Forward pass proof details

Given the µ-incoherence of D , and current dictionary closeness of δl, we re-state Lemma 3.1 and proof it below. It shows that the current dictionary is µl-close to D .

Lemma 3.1 (µl-incoherent). D(l) is µl-incoherent where µl = µ + 2 mδl.

D(l) i , D(l) j = D i , D j D i D(l) i , D j D(l) i , D j D(l) j

| D(l) i , D(l) j | µ/ m + D i + D(l) i 2 D j 2 + D(l) i 2 D j D(l) j 2 µ/ m + 2δl (22)

We re-state and proof the forward pass support recovery (Theorem 4.1). This shows that given proper initialization and under mild conditions, the support of the true code z is recovered with high probability in one iteration of the encoder.

Theorem 4.1 (Forward pass support recovery). Given Assumptions 3.3 and 3.4, suppose D(l) is δl = O (1/ log p) close to D . If s = O ( m/µ log m), and µ = O(log m), then with probability of at least 1 ϵ(l) supp-rec, the choice of λ0 = Cmin/4 recovers the support of the code z in one encoder iteration, i.e., sign(Sαλ0(αD(l)Tx) = sign(z ), where ϵ(l) supp-rec = 2p exp ( C2 min O (δ2 l )).

Proof. The code estimate after one iteration is z1 = Pαλ(αD(l)Tx) = sign(D(l)Tx)Re LU(α(|D(l)TD z | λ0)). We focus on the positive entries. The analysis for negative entries is similar. Writting the relation for i-th entry, z1,(i) = sign(D(l)Tx)Re LU(α( X

j S D(l) i , D j z (j) λ0))

= Re LU(α( D(l) i , D i z (i) + X

j S \{i} D(l) i , D j z (j) λ0)) (23)

We focus on the term inside Re LU and discard α, shared by all terms. We shows that under proper choice of λ0, D(l) i , D i z (i) is greater than λ0 and vi = P

j S \{i} D(l) i , D j z (j) is small with respect to λ0, hence

Published in Transactions on Machine Learning Research (08/2022)

getting cancelled by Re LU. The small value of vi, compared to D(l) i , D i z (i), results in sign(D(l)Tx) be equal to the sign(D(l)TD z ) which is equal to the sign of z .

Given the current dictionary distance D(l) i D i 2 δl, we can find a lower bound on D(l) i , D i z (i) as follows D(l) i , D i = 1

2( D i 2 2 + D(l) i 2 2 D(l) i D i 2 2) = 1 1

2 D(l) i D i 2 2

| D(l) i , D i | 1 δ2 l /2 (24)

Hence, for i S

| D(l) i , D i z (i)| (1 δ2 l /2)Cmin (25)

otherwise, it is 0. Given, var(z (i)) = 1 for i S , we find an upper bound on the variance vi of as follows

var(vi) = X

j S \{i} D(l) i , D j 2 = X

j S \{i} ( D i , D j + D(l) i D i , D j )2

j S \{i} 2( D i , D j 2 + D(l) i D i , D j 2) X

j S \{i} (2µ2/m) + 2 (D(l) i D i )TD S\{i} 2 2

(2sµ2/m) + 2 (D(l) i D i ) 2 2 D S \{i} 2 2 2(sµ2/m + 4δ2 l ) = O (δ2 l )

where we used the Gershgorin Circle Theorem for the bound D S\{i} 2 2. With the sub-Gaussian assumption on the coefficients z , we get the following using Chernoff bound concerning vi.

P(|vi| Cmin

4 ) 2 exp ( C2 min 4sµ2/m + 16δ2 l ) = 2 exp ( C2 min O (δ2 l )) (27)

Taking a union bound over all indices i [1, p] will result in

P(max i |vi| Cmin

4 ) 2p exp ( C2 min O (δ2 l )) := ϵ(l) supp-rec (28)

Hence, we can set λ0 = Cmin/2.

We re-state and prove the forward pass support preservation (Theorem 4.2).

Theorem 4.2 (Forward pass support preservation). Given Assumptions 3.3 and 3.4, suppose D(l) is δl = O (1/ log p) close to D . If s = O ( m/µ log m), µ = O(log m), and the regularizer and step size are

chosen such that λ(l) t = µl m z zt 1 + aγ = Ω( s log m m ) and α(l) 1 2λ(l) t (1 δ2 l 2 )Cmin λ(l) t 1 , then with probability

of at least 1 ϵ(l) supp-pres, the support, recovered at the first iteration, is preserved through the encoder iterations. We have aγ = O( sδl) and ϵ(l) supp-pres := ϵ(l) supp-rec + ϵ(l) γ = 2p exp ( C2 min O (δ2 l )) + 2s exp ( 1 O(δl)).

Proof. Given current dictionary D(l), in each iteration of the forward pass, we have zt+1 = Pαλ(zt + αDT(D z Dzt). We focus on the entires that are non-negative. Then procedure for negative code entries is similar. We follow similar steps as in (Rambhatla et al., 2018). We get

zt+1,(j) = Re LU((I αD(l)TD(l))(j,:)zt + α(D(l)TD )(j,:)z αλ(l) t,j)

= Re LU((I αD(l)TD(l))(j,:)zt + α((D(l) D )TD )(j,:)z + α(D TD )(j,:)z αλ(l) t,j)

= Re LU((1 α)zt,(j) α X

i =j D(l) j , D(l) i zt,(i) + α (D(l) j D j ), D j z (j)

i =j D(l) j D j , D i z (i) + αz (j) + α X

i =j D j , D i z (i) αλ(l) t,j)

= Re LU((1 α)zt,(j) + α(1 β(l) j )z (j) + αη(l) t,j αλ(l) t,j)

Published in Transactions on Machine Learning Research (08/2022)

where β(l) j = D j D(l) j , D j , and η(l) t,j = P

i =j D(l) j , D(l) i zt,(i) + ( D(l) j D j , D i + D j , D i )z (i). With

D(l) j D j 2 δl, β(l) j can be bounded as follows

β(l) j = D j D(l) j , D j δ2 l /2 (30)

where we used the relation D(l) j D j 2 2 = 2(1 D(l) j , D j ). We re-write η(l) t,j below

η(l) t,j = X

i =j D(l) j , D(l) i zt,(i) + ( D(l) j D j , D i + D j , D i )z (i)

i =j D(l) j , D(l) i zt,(i) + X

i =j ( D(l) j D j , D i + D j , D i )z (i) + X

i =j D(l) j , D(l) i z (i) X

i =j D(l) j , D(l) i z (i)

i =j D(l) j , D(l) i (z (i) zt,(i)) + X

i =j ( D(l) j D j , D i + D j , D i D(l) j , D(l) i )z (i)

i =j D(l) j , D(l) i (z (i) zt,(i)) + X

i =j D(l) j , D i D(l) i z (i)

i =j D(l) j , D(l) i (z (i) zt,(i)) + γ(l) j

(31) where γ(l) j = P

i =j D(l) j , D i D(l) i z (i). Given the sub-Gaussian entries of the code z , we provide a bound

on the variance of γ(l) j below:

var(γ(l) j ) = X

i =j D(l) j , D i D(l) i 2 sδ2 l (32)

Now, using Chernoff bound on the sub-Gaussian code entries, we get

P(|γ(l) j | > a) 2 exp ( a2

2sδ2 l ) (33)

To bound all the terms in the support, for j S , we have

P(max |γ(l) j | > aγ) ϵ(l) γ (34)

where ϵ(l) γ = 2s exp ( a2 γ 2sδ2 l ). Let aγ = O( sδl), then ϵ(l) γ = 2s exp ( 1 O(δl)). The above analysis states that with

probability of at least 1 ϵ(l) γ , |γ(l) j | aγ = O( sδl). Next, we write the recursion for when the support is identified (see Theorem 4.1). For the code at iteration T, we have

z T,(j) = (1 α)T z0,(j) + z (j)

t=1 α(1 β(l) j )(1 α)T t +

t=1 α(η(l) t 1,j λ(l) t 1,j)(1 α)T t

= (1 α)T z0,(j) + z (j)(1 β(l) j )(1 (1 α)T ) +

t=1 α(η(l) t 1,j λ(l) t 1,j)(1 α)T t

= z (j)(1 β(l) j ) + (1 α)T (z0,(j) z (j)(1 β(l) j )) +

t=1 α(η(l) t 1,j λ(l) t 1,j)(1 α)T t

= z (j)(1 β(l) j ) + ζ(l) T,j

where ζ(l) T,j = (1 α)T (z0,(j) z (j)(1 β(l) j )) + PT t=1 α(η(l) t 1,j λ(l) t 1,j)(1 α)T t. With the support correctly identified at iteration t 1, we show that the support is preserved at iteration t. With z zt 1 = O(s), for each j S , we have

η(l) t,j = X

i =j D(l) j , D(l) i (z (i) zt,(i)) + γ(l) j µl m z zt 1 + aγ = O(s log m m ) (36)

Published in Transactions on Machine Learning Research (08/2022)

We make sure the regularizer is chosen such that

λt µl m z zt 1 + aγ (37)

We see that the larger the code error and coherence between the columns of the current dictionary, the larger λt should be. This is to suppress the noise component in the code recursion and make sure no false support is introduced. Furthermore, we want λt to be lower than half of the signal component, i.e.,

2 zt,(j) + α

2 (1 β(l) j )z (j), j S

2 zmin t + α

2 (1 δ2 l 2 )Cmin (38)

where zmin t = minj zt,(j). We further shrink the upper bound, given the code from previous iteration t 1 (i.e., αλt 1 zmin t ). Hence, we want the regularizer to follow

2 αλt 1 + α

2 (1 δ2 l 2 )Cmin

2(1 δ2 l 2 )Cmin

This condition is to make sure the identified supports are not killed in the recursion. We use the condition to set the step size α. We get

α 1 2λt (1 δ2 l 2 )Cmin λt 1 (40)

Hence, λt = Ω( s log m m ) and α should be chosen sufficiently small such that the condition above is met. We

denote ϵ(l) supp-pres := ϵ(l) supp-rec + ϵ(l) γ = 2p exp ( C2 min O (δ2 l )) + 2s exp ( 1 O(δl)). Hence, with probability of at least

1 ϵ(l) supp-pres, the support, recovered at the first iteration, is preserved through the encoder iterations.

Theorem 4.1 and Theorem 4.2 allow to achieve linear convergence in the forward pass right after the first encoder iteration, i.e., B = 1. With support recovery at first iteration and its preservation, we now re-state the forward pass code convergence (Theorem 4.3).

Theorem 4.3 (Local forward pass code convergence). Given the encoder zt+1 = Φ(zt, D), Assumption 3.6, Lemmas 3.2, A.1 and A.2, then ρ < 1, B > 0 s.t. zt ˆz 2 O(ρt) t > B, where ˆz is the unique minimizer of lasso (1). Furthermore, given Theorem 4.1 and Theorem 4.2, B = 1.

Proof. Given the support selection at iteration B, from Lemma A.1, we have 2 11Lx(zt, D) µI restricted to the support for t > B. Then, from Lemma A.2, we get

1Φ(zt, D) 2 = (I α 2 11Lx(zt, D)) Pαλ(zt) 2 ρ

where ρ cprox(1 αµ) < 1. Hence, using fixed-point property (Lemma 3.2)

B > 0, s.t. zt+1 ˆz 2 = Φ(zt) Φ(ˆz) 2 ρ zt ˆz 2 t > B

where ˆz = arg min fx(z, D). Unrolling the recursion,

zt ˆz 2 ρt B z B ˆz 2.

Theorem 4.4 (Global forward pass code error with variable λt). Given Assumptions 3.3 and 3.4, suppose D(l)

is µl-incoherent and δl = O (1/ log p) close to D . If s = O ( m/µ log m), µ = O(log m), and the regularizer

Published in Transactions on Machine Learning Research (08/2022)

and step size are chosen such that λ(l) t = µl m z zt 1 + aγ = Ω( s log m m ) and α(l) 1 2λ(l) t (1 δ2 l 2 )Cmin λ(l) t 1 ,

then with probability of at least 1 ϵ(l) supp-pres, for j supp(z ), the code coefficient error is

|z(l) t,(j) z (j)| O( q

s D(l) j D j 2 + e(l)unroll t,j ) (8)

and z T,(j) = z (j)(1 β(l) j ) + ζ(l) T,j (9)

where e(l)unroll t,j := 2(s 1)tα µl m maxi |z(l) 0,(i) z (i)|δα,t 1 + |z(l) 0,(j) z (j)|δα,t, δα,t := (1 α + 2α µl m)t,

|ζ(l) T,j| = O(aγ) with aγ = O( sδl), β(l) j = D j D(l) j , D j δ2 l 2 and ϵ(l) supp-pres := ϵ(l) supp-rec + ϵ(l) γ =

2p exp ( C2 min O (δ2 l )) + 2s exp ( 1 O(δl)). With appropriately large t, |z(l) t,(j) z (j)| = O( q

s D(l) j D j 2).

Proof. We define η(l) t,j := P

i =j | D(l) j , D(l) i ||z (i) zt,(i)| + γ(l) j and upper bound it as

η(l) t,j µl m

i =j Et,i + γ(l) j (41)

where Et,i := |z (i) zt,(i)|. Given (21), we re-write the code recursion

zt+1,(j) = Pαλ(l) t,j((1 α)zt,(j) + α(1 β(l) j )z (j) + αη(l) t,j)

(1 α)zt,(j) + α(1 β(l) j )z (j) + α(η(l) t,j λ(l) t,j |zt+1,(j)|)

Et+1,j = |zt+1,(j) z (j)| (1 α)Et,j + αβ(l) j |z (j)| + α( η(l) t,j + λ(l) t,j)

Opening up the recursion, we get

Et+1,j E0,j

q=0 (1 α) + β(l) j |z (j)|

q=a (1 α) +

a=1 α( η(l) a 1,j + λ(l) a 1,j)

q=a (1 α) (43)

where we define Qa q=a(1 α) = 1. Using the upper bound from (41) and λ(l) t,j from Theorem 4.2, we get

Et+1,j vt+1,j + 2 µl mα

i =j Ea 1,i

q=a (1 α) (44)

where vt+1,j := E0,j Qt q=0(1 α) + (β(l) j |z (j)| + 2γ(l) j ) Pt+1 a=1 α Qt+1 q=a(1 α). We now derive the general upper bound on Et+1,j as follows E1,i1 v1,i1 + 2α µl m

i2 =i1 E0,i2 (45)

For E2,i1, we have E2,i1 v2,i1 + 2α µl m( X

i2 =i1 E1,i2 + X

i2 =i1 E0,i2(1 α)) (46)

Substituting E1,i1,

E2,i1 v2,i1 + 2α µl m( X

i2 =i1 (v1,i2 + 2α µl m

i3 =i2 E0,i3) + X

i2 =i1 E0,i2(1 α)) (47)

For E3,i1, we have

E3,i1 v3,i1 + 2α µl m

i2 =i1 Ea 1,i2(1 α)3 a (48)

Published in Transactions on Machine Learning Research (08/2022)

Unrolling the recursion,

E3,i1 v3,i1 + 2α µl m

i2 =i1 v2,i2 + 2α µl m

i2 =i1 v1,i2 + 2α µl m

i3 =i2 v1,i3

i2 =i1 E0,i2 + 2(1 α)(2α µl m) X

i3 =i2,i1 E0,i3 + (2α µl m)2 X

i4 =i3,i2,i1 E0,i4

Given above, we can write up the relation as

E3,i1 v3,i1 + 2α µl m

(s 1)v2,i + v1,i((1 α)(s 1) + 2α µl m(s 2))

+ 2α µl m E0,i

(s 1)(1 α)2 + 2(1 α)(2α µl m)(s 2) + (2α µl m)2(s 3)

v3,i1 + (s 1)2α µl m

v2,i + v1,i((1 α) + 2α µl m)

+ 2(s 1)α µl m E0,i

(1 α)2 + 2(1 α)(2α µl m) + (2α µl m)2

Following similar steps for E4,i1, we get,

E4,i1 v4,i1 + (s 1)2α µl m

v3,i + v2,i(1 α + 2α µl m) + v1,i((1 α)2 + 2(1 α)(2α µl m) + (2α µl m)2)

+ 2(s 1)α µl m E0,i

(1 α)3 + 3(1 α)(2α µl m)2 + 3(1 α)2(2α µl m) + (2α µl m)3

(51) This leads to the term

E4,i1 v4,i1 + (s 1)2α µl m

v3,i + v2,i(1 α + 2α µl m)1 + v1,i(1 α + 2α µl m)2

+ 2(s 1)α µl m E0,i(1 α + 2α µl m)3 (52)

Hence, the general term for code error at t layer is

Et+1,j vt+1,j + 2(s 1)α µl m

a=1 va,max(1 α + 2α µl m)t a + 2(s 1)α µl m E0,max(1 α + 2α µl m)t

(53) where for j in the support, we define the upper bounds va,j va,max and E0,j E0,max. Next, we define (1 α)t δα,t := (1 α + 2α µl m)t, and use it to find an upper bound on the expression Pt a=1 va,max(1 α + 2α µl m)t a. We have

vt,j = E0,j(1 α)t + (β(l) j |z (j)| + 2γ(l) j )

k=1 α(1 α)t k+1 (54)

We bound the expression

a=1 va,j(1 α + 2α µl m)t a

a=1 (E0,j(1 α)a + (β(l) j |z (j)| + 2γ(l) j )

k=1 α(1 α)a k+1)(1 α + 2α µl m)t a

E0,jtδα,t + (β(l) j |z (j)| + 2γ(l) j )

a=1 (1 α + 2α µl m)t a a X

k=1 α(1 α)a k+1

Published in Transactions on Machine Learning Research (08/2022)

Using sum of geometric series, we write Pt a=1(1 α + 2α µl m)t a = 1 (1 α+2α µl m )t

α 2α µl m 1 α(1 2 µl m ). Hence,

using (30) and (34), with probability of at least 1 ϵ(l) γ , we have

a=1 va,max(1 α + 2α µl m)t a E0,maxtδα,t + 1 α(1 2 µl m)(δ2 l 2 |z max| + 2aγ) (56)

Hence, we bound the code error on the coefficients as following

Et+1,j vt+1,j + 2(s 1) µl m( 1 (1 2 µl m)(δ2 l 2 |z max| + 2aγ)) + 2(t + 1)(s 1)α µl m E0,maxδα,t (57)

Next, we further simplify the first term

vt+1,j = E0,j(1 α)t+1 + (β(l) j |z (j)| + 2γ(l) j )

k=1 α(1 α)t k+1 E0,jδα,t+1 + δ2 l 2 |z max| + 2aγ (58)

Substituting the above upper bound into the upper bound for Et+1,j, we get

Et+1,j E0,jδα,t+1 + (1 + 2(s 1)κl)(δ2 l 2 |z max| + 2aγ) + 2(t + 1)(s 1)α µl m E0,maxδα,t (59)

where κl := µl m( 1 (1 2 µl m ). Given s = O ( m/µ log m), we have (s 1)κl < 1. Hence, with probability of at

least 1 ϵ(l) γ , we have

Et,j O(aγ) + 2(s 1)tα µl m E0,maxδα,t 1 + E0,jδα,t

|z(l) t,(j) z (j)| O( q

s D(l) j D j 2 + e(l)unroll t,j ) (60)

where e(l)unroll t,j := 2(s 1)tα µl m maxi |z(l) 0,(i) z (i)|δα,t 1 +|z(l) 0,(j) z (j)|δα,t. With appropriately large unrolled

layer t, e(l)unroll t,j 0. Hence, for the code error on non-zero coefficients, we get

|z(l) t,(j) z (j)| = O( q

s D(l) j D j 2) (61)

for large enough t. Now, we try to prove the relation for z T,(j). For shrinkage, we re-write (35)

z T,(j) z (j)(1 β(l) j ) + ζ(l) T,j (62)

where ζ(l) T,j = κ(l) T,j +PT t=1 α(η(l) t 1,j λ(l) t 1,j |zt,(j)|)(1 α)T t and κ(l) T,j := (1 α)T (z0,(j) z (j)(1 β(l) j )). κ(l) T,j decays very fast as T increases. Hence, we bound the second term. We substitute η(l) t,j = P

i =j D(l) j , D(l) i (z (i)

zt,(i)) + γ(l) j in ζ(l) T,j.

ζ(l) T,j κ(l) T,j +

t=1 α(η(l) t 1,j λ(l) t 1,j |zt,(j)|)(1 α)T t

i =j D(l) j , D(l) i (z (i) zt 1,(i)) + 2γ(l) j λ(l) t 1,j |zt,(j)|)(1 α)T t

κ(l) T,j + 2γ(l) j

t=1 α(1 α)T t + 2α µl m

i =j Et 1,j(1 α)T t

κ(l) T,j + 2γ(l) j + 2(s 1)α µl m

t=1 Et 1,j(1 α)T t

Published in Transactions on Machine Learning Research (08/2022)

Given above, we find an upper bound on Et 1,j(1 α)T t below. From analysis in Theorem 4.4, we have

Et 1,j vt 1,j + 2(s 1)α µl m

a=1 va,max(1 α + 2α µl m)t a 2

+ 2(s 1)α µl m E0,max(1 α + 2α µl m)t 2

Et 1,j(1 α)T t vt 1,j(1 α)T t + 2(s 1)α µl m

a=1 va,max(1 α + 2α µl m)t a 2(1 α)T t

+ 2(s 1)α µl m E0,max(1 α + 2α µl m)t 2(1 α)T t

Re-write the first term,

vt 1,j(1 α)T t = (E0,j(1 α)t 1 + (β(l) j |z (j)| + 2γ(l) j )

k=1 α(1 α)t k 1)(1 α)T t

= E0,j(1 α)T 1 + (β(l) j |z (j)| + 2γ(l) j )

k=1 α(1 α)T k 1 (65)

va,max = E0,max(1 α)a + (β(l) max|z max| + 2γ(l) max)

k=1 α(1 α)a k (66)

a=1 va,max(1 α + 2α µl m)t a 2 =

a=1 E0,max(1 α)a(1 α + 2α µl m)t a 2

a=1 (β(l) max|z max| + 2γ(l) max)

k=1 α(1 α)a k(1 α + 2α µl m)t a 2

a=1 E0,max(1 α + 2α µl m)t 2

+ (β(l) max|z max| + 2γ(l) max)

a=1 (1 α + 2α µl m)t a 2

(1 α)T t t 2 X

a=1 va,max(1 α + 2α µl m)t a 2

a=1 E0,max(1 α + 2α µl m)t 2(1 α)T t

+ (β(l) max|z max| + 2γ(l) max)

a=1 (1 α + 2α µl m)t a 2(1 α)T t

(t 2)E0,max(1 α + 2α µl m)T 2

+ (β(l) max|z max| + 2γ(l) max) (1 α)T t

α(1 2 µl m) (68)

Published in Transactions on Machine Learning Research (08/2022)

Combining all terms, we get

Et 1,j(1 α)T t vt 1,j(1 α)T t + 2(s 1)α µl m

a=1 va,max(1 α + 2α µl m)t a 2(1 α)T t

+ 2(s 1)α µl m E0,max(1 α + 2α µl m)t 2(1 α)T t

E0,j(1 α)T 1 + 2(s 1)α µl m(t 1)E0,maxδα,T 2

+ (β(l) max|z max| + 2γ(l) max)

k=1 α(1 α)T k 1 + 2(s 1)κl(1 α)T t !

(1 + 2(s 1)α µl m(t 1))E0,maxδα,T 2

+ (β(l) max|z max| + 2γ(l) max)

k=1 α(1 α)T k 1 + 2(s 1)κl(1 α)T t !

µl m (1 2 µl m ). Moreover, we bound

k=1 α(1 α)T k 1 =

t=1 α(1 α)T t 1 (1 α)t 1

t=1 (1 α)T t 1

Finally, we are ready to write the bound for ζ(l) T,j

|ζ(l) T,j| κ(l) T,j + 2|γ(l) j | + 2(s 1) µl m(β(l) max|z max| + 2γ(l) max)(1 + 2(s 1)κl)

t=1 2(s 1)α µl m((1 + 2(s 1)α µl m(t 1))E0,maxδα,T 2) (71)

Given |γ(l) j | = aγ with probability of 1 ϵ(l) γ where aγ = sδl and s = O ( m/µ log m), we will have

|ζ(l) T,j| O( q

s D(l) j D j 2) (72)

Theorem 4.5 (Global forward pass code error with fixed λt). Given Assumptions 3.3 and 3.4, suppose D(l) is µl-incoherent and δl = O (1/ log p) close to D . If s = O ( m/µ log m), µ = O(log m), and the regularizer and step size are chosen such that λ(l) t = λfixed = µl m z z0 1 + aγ = Ω( s log m m ) and

α(l) 1 2λ(l) t (1 δ2 l 2 )Cmin λ(l) t 1 , then with probability of at least 1 ϵ(l) supp-pres, for j supp(z ), the code coefficient

|z(l) t,(j) z (j)| O( q

s D(l) j D j 2 + e(l)unroll,fixed t,j + λfixed) (10)

and z T,(j) = z (j)(1 β(l) j ) + ζ(l) T,j (11)

where e(l)unroll, fixed t,j := (s 1)tα µl m maxi |z(l) 0,(i) z (i)|δfixed α,t 1 + |z(l) 0,(j) z (j)|δfixed α,t , δfixed α,t := (1 α + α µl m)t,

|ζ(l) T,j| = O(aγ + λfixed) with aγ = O( sδl), β(l) j = D j D(l) j , D j δ2 l 2 , and ϵ(l) supp-pres := ϵ(l) supp-rec + ϵ(l) γ =

2p exp ( C2 min O (δ2 l )) + 2s exp ( 1 O(δl)). With appropriately large t, |z(l) t,(j) z (j)| = O( q

s D(l) j D j 2 + λfixed).

Published in Transactions on Machine Learning Research (08/2022)

Proof. We denote the regularization used in all layers λj 1 = λj 1 = = λp 1 = λfixed. We assume that there exists such λfixed that meets the lower bounds of regularization and also allow to pick an α > 0 according to Theorem 4.2. We define η(l) t,j := P

i =j | D(l) j , D(l) i ||z (i) zt,(i)| + γ(l) j and upper bound it as

η(l) t,j µl m

i =j Et,i + γ(l) j (73)

where Et,i := |z (i) zt,(i)|. Given (21), we re-write the code recursion

zt+1,(j) = Pαλ(l) t,j((1 α)zt,(j) + α(1 β(l) j )z (j) + αη(l) t,j)

(1 α)zt,(j) + α(1 β(l) j )z (j) + α(η(l) t,j λ(l) t,j |zt+1,(j)|)

Et+1,j = |zt+1,(j) z (j)| (1 α)Et,j + αβ(l) j |z (j)| + α( η(l) t,j + λ(l) t,j)

Opening up the recursion, we get

Et+1,j E0,j

q=0 (1 α) + β(l) j |z (j)|

q=a (1 α) +

a=1 α( η(l) a 1,j + λ(l) a 1,j)

q=a (1 α) (75)

where we define Qa q=a(1 α) = 1. Using the upper bounds from (73) and λ(l) t,j from Theorem 4.2, we get

Et+1,j vt+1,j + rt+1,j +

i =j Ea 1,i

q=a (1 α) (76)

where vt+1,j := E0,j Qt q=0(1 α)+(β(l) j |z (j)|+γ(l) j ) Pt+1 a=1 α Qt+1 q=a(1 α) and rt+1,j := Pt+1 a=1 αλ(l) a 1,j Qt+1 q=a(1 α). Following similar steps in Theorem 4.4, we now derive the general upper bound on Et+1,j as follows

E1,i1 v1,i1 + r1,i1 + α µl m

i2 =i1 E0,i2 (77)

For E2,i1, we have

E2,i1 v2,i1 + r2,i1 + α µl m( X

i2 =i1 E1,i2 + X

i2 =i1 E0,i2(1 α)) (78)

Substituting E1,i1,

E2,i1 v2,i1 + r2,i1 + α µl m( X

i2 =i1 (v1,i2 + r1,i2 + α µl m

i3 =i2 E0,i3) + X

i2 =i1 E0,i2(1 α)) (79)

For E3,i1, we have

E3,i1 v3,i1 + r3,i1 + α µl m

i2 =i1 Ea 1,i2(1 α)3 a (80)

Unrolling the recursion,

E3,i1 v3,i1 + r3,i1 + α µl m

i2 =i1 (v2,i2 + r2,i2) + α µl m

i2 =i1 (v1,i2 + r1,i2) + α µl m

i3 =i2 (v1,i3 + r1,i3)

i2 =i1 E0,i2 + 2(1 α)(α µl m) X

i3 =i2,i1 E0,i3 + (α µl m)2 X

i4 =i3,i2,i1 E0,i4

Published in Transactions on Machine Learning Research (08/2022)

Given above, we can write up the relation as

E3,i1 v3,i1 + r3,i1 + α µl m

(s 1)(v2,i + r2,i) + (v1,i + r1,i)((1 α)(s 1) + α µl m(s 2))

+ α µl m E0,i

(s 1)(1 α)2 + 2(1 α)(α µl m)(s 2) + (α µl m)2(s 3)

v3,i1 + r3,i1 + (s 1)α µl m

v2,i + r2,i + (v1,i + r1,i)((1 α) + α µl m)

+ (s 1)α µl m E0,i

(1 α)2 + 2(1 α)(α µl m) + (α µl m)2

We denote ut+1,i := vt+1,i + rt+1,i and following similar steps for E4,i1, we get,

E4,i1 u4,i1 + (s 1)α µl m

u3,i + u2,i(1 α + α µl m) + u1,i((1 α)2 + 2(1 α)(α µl m) + (α µl m)2)

+ (s 1)α µl m E0,i

(1 α)3 + 3(1 α)(α µl m)2 + 3(1 α)2(α µl m) + (α µl m)3

(83) This leads to the term

E4,i1 u4,i1 + (s 1)α µl m

u3,i + u2,i(1 α + α µl m)1 + u1,i(1 α + α µl m)2

+ (s 1)α µl m E0,i(1 α + α µl m)3 (84)

Hence, the general term for code error at t layer is

Et+1,j ut+1,j + (s 1)α µl m

a=1 (va,max + ra,max)(1 α + α µl m)t a + (s 1)α µl m E0,max(1 α + α µl m)t

(85) where for j in the support, we define the upper bounds va,j va,max, ra,j ra,max, and E0,j E0,max. Next, we define (1 α)t δfixed α,t := (1 α + α µl m)t, and use it to find an upper bound on the two expressions Pt a=1 va,max(1 α + α µl m)t a and Pt a=1 ra,max(1 α + α µl m)t a. The following bound can be achieved similar to the steps in Theorem 4.4

a=1 va,max(1 α + α µl m)t a E0,maxtδfixed α,t + 1 α(1 µl m)(δ2 l 2 |z max| + aγ) (86)

Hence, we focus on Pt a=1 ra,max(1 α + α µl m)t a next. First, we rewrite

a=1 αλ(l) a 1,j

q=a (1 α) (87)

We replace all λ(l) t,j with a fixed one λfixed, and write

a=1 ra,j(1 α + α µl m)t a

a=1 λfixed a X

k=1 α(1 α)a k+1)(1 α + α µl m)t a

a=1 (1 α + α µl m)t a a X

k=1 α(1 α)a k+1 (88)

Using sum of geometric series, we write Pt a=1(1 α + α µl m)t a = 1 (1 α+α µl m )t

α α µl m 1 α(1 µl m ). Hence, we get

a=1 ra,max(1 α + α µl m)t a λfixed 1 α(1 µl m) (89)

Published in Transactions on Machine Learning Research (08/2022)

Hence, we bound the code error on the coefficients as following

Et+1,j ut+1,j + (s 1) µl m( 1 (1 µl m)(δ2 l 2 |z max| + aγ + λfixed)) + (t + 1)(s 1)α µl m E0,maxδfixed α,t (90)

Next, we further simplify the first term. From before, we have

vt+1,j = E0,j(1 α)t+1 + (β(l) j |z (j)| + γ(l) j )

k=1 α(1 α)t k+1 E0,jδfixed α,t+1 + δ2 l 2 |z max| + aγ (91)

rt+1,j = αλfixed t+1 X

k=1 (1 α)t k+1 λfixed (92)

Substituting the above upper bound into the upper bound for Et+1,j, we get

Et+1,j E0,jδfixed α,t+1 + (1 + (s 1)κfixed l )(δ2 l 2 |z max| + aγ + λfixed) + (t + 1)(s 1)α µl m E0,maxδfixed α,t (93)

where κfixed l := µl m( 1 (1 µl m ). Given s = O ( m/µ log m), we have (s 1)κfixed l < 1. Hence, with probability

of at least 1 ϵ(l) γ , we have

Et,j O(aγ + λfixed) + (s 1)tα µl m E0,maxδfixed α,t 1 + E0,jδfixed α,t

|z(l) t,(j) z (j)| O( q

s D(l) j D j 2 + λfixed + e(l)unroll, fixed t,j ) (94)

where e(l)unroll, fixed t,j := (s 1)tα µl m maxi |z(l) 0,(i) z (i)|δfixed α,t 1 + |z(l) 0,(j) z (j)|δfixed α,t . With appropriately large

unrolled layer t, e(l)unroll, fixed t,j 0. Hence, for the code error on non-zero coefficients, we get

|z(l) t,(j) z (j)| = O( q

s D(l) j D j 2 + λfixed) (95)

for large enough t. Now, we provide the relation for z T,(j). We re-write (35)

z T,(j) z (j)(1 β(l) j ) + ζ(l) T,j (96)

where ζ(l) T,j = κ(l) T,j +PT t=1 α(η(l) t 1,j λ(l) t 1,j |zt,(j)|)(1 α)T t and κ(l) T,j := (1 α)T (z0,(j) z (j)(1 β(l) j )). κ(l) T,j decays very fast as T increases. Hence, we bound the second term. We substitute η(l) t,j = P

i =j D(l) j , D(l) i (z (i)

zt,(i)) + γ(l) j in ζ(l) T,j.

ζ(l) T,j κ(l) T,j +

t=1 α(η(l) t 1,j λ(l) t 1,j |zt,(j)|)(1 α)T t

i =j D(l) j , D(l) i (z (i) zt 1,(i)) + γ(l) j λ(l) t 1,j |zt,(j)|)(1 α)T t

κ(l) T,j + (γ(l) j + λfixed)

t=1 α(1 α)T t + α µl m

i =j Et 1,j(1 α)T t

κ(l) T,j + γ(l) j + λfixed + (s 1)α µl m

t=1 Et 1,j(1 α)T t

Published in Transactions on Machine Learning Research (08/2022)

Given above, we find an upper bound on Et 1,j(1 α)T t below. From analysis in Theorem 4.4, we have

Et 1,j vt 1,j + (s 1)α µl m

a=1 va,max(1 α + α µl m)t a 2

+ (s 1)α µl m E0,max(1 α + α µl m)t 2

Et 1,j(1 α)T t vt 1,j(1 α)T t + 2(s 1)α µl m

a=1 va,max(1 α + α µl m)t a 2(1 α)T t

+ (s 1)α µl m E0,max(1 α + α µl m)t 2(1 α)T t

Re-write the first term,

vt 1,j(1 α)T t = (E0,j(1 α)t 1 + (β(l) j |z (j)| + γ(l) j + λfixed)

k=1 α(1 α)t k 1)(1 α)T t

= E0,j(1 α)T 1 + (β(l) j |z (j)| + γ(l) j + λfixed)

k=1 α(1 α)T k 1 (99)

va,max = E0,max(1 α)a + (β(l) max|z max| + γ(l) max + λfixed)

k=1 α(1 α)a k (100)

a=1 va,max(1 α + α µl m)t a 2 =

a=1 E0,max(1 α)a(1 α + α µl m)t a 2

a=1 (β(l) max|z max| + γ(l) max + λfixed)

k=1 α(1 α)a k(1 α + α µl m)t a 2

a=1 E0,max(1 α + α µl m)t 2

+ (β(l) max|z max| + γ(l) max + λfixed)

a=1 (1 α + α µl m)t a 2

(101) Hence,

(1 α)T t t 2 X

a=1 va,max(1 α + α µl m)t a 2

a=1 E0,max(1 α + α µl m)t 2(1 α)T t

+ (β(l) max|z max| + γ(l) max + λfixed)

a=1 (1 α + α µl m)t a 2(1 α)T t

(t 2)E0,max(1 α + α µl m)T 2

+ (β(l) max|z max| + γ(l) max + λfixed)(1 α)T t

α(1 µl m) (102)

Published in Transactions on Machine Learning Research (08/2022)

Combining all terms, we get

Et 1,j(1 α)T t vt 1,j(1 α)T t + (s 1)α µl m

a=1 va,max(1 α + α µl m)t a 2(1 α)T t

+ (s 1)α µl m E0,max(1 α + α µl m)t 2(1 α)T t

E0,j(1 α)T 1 + (s 1)α µl m(t 1)E0,maxδα,T 2

+ (β(l) max|z max| + γ(l) max + λfixed)

k=1 α(1 α)T k 1 + (s 1)κl(1 α)T t !

(1 + (s 1)α µl m(t 1))E0,maxδα,T 2

+ (β(l) max|z max| + γ(l) max + λfixed)

k=1 α(1 α)T k 1 + (s 1)κl(1 α)T t !

where κfixed l =

µl m (1 µl m ). Moreover, we bound

k=1 α(1 α)T k 1 =

t=1 α(1 α)T t 1 (1 α)t 1

t=1 (1 α)T t 1

Finally, we are ready to write the bound for ζ(l) T,j

|ζ(l) T,j| κ(l) T,j + |γ(l) j | + λfixed + (s 1) µl m(β(l) max|z max| + γ(l) max + λfixed)(1 + (s 1)κfixed l )

t=1 (s 1)α µl m((1 + (s 1)α µl m(t 1))E0,maxδα,T 2) (105)

Given |γ(l) j | = aγ with probability of 1 ϵ(l) γ where aγ = sδl and s = O ( m/µ log m), we will have

|ζ(l) T,j| O( q

s D(l) j D j 2 + λfixed) (106)

We now re-state the forward pass Jacobian (Theorem 4.7) convergence.

Theorem 4.7 (Local forward pass Jacobian convergence). Given the recursion zt+1 = Φ(zt, D), and ˆz the unique minimizer of lasso with Jacobian ˆ J, then ρ < 1, B > 0 s.t. Jt ˆ J 2 O(tρt) t > B. Furthermore, given Theorem 4.1 and Theorem 4.2, B = 1.

Proof. Differentiating the recursion,

Jt+1 = 1Φ(zt, D)TJt + 2Φ(zt, D)T.

Similarly, ˆ J = 1Φ(ˆz, D)T ˆ J + 2Φ(ˆz, D)T

where ˆz is a minimizer of lasso and fixed-point of the mapping (see Lemma 3.2). Subtract the terms

Jt+1 ˆ J = 1Φ(zt, D)T(Jt ˆ J) + ( 1Φ(zt, D) 1Φ(ˆz, D))T ˆ J + ( 2Φ(zt, D) 2Φ(ˆz, D))T

Published in Transactions on Machine Learning Research (08/2022)

Given the Lipschitz properties of L and h, we can further get the upper bounds on 1Φ(a, D) 1Φ(b, D) 2 LΦ1 b a 2 and 2Φ(a, D) 2Φ(b, D) 2 LΦ2 b a 2. Hence, with upper bound on the norm of Jacobian (Assumption 4.1), there exists B > 0 such that t > B

Jt+1 ˆ J 2 1Φ(zt, D) 2 Jt ˆ J 2 + 1Φ(zt, D) 1Φ(ˆz, D) 2 ˆ J 2 + 2Φ(zt, D) 2Φ(ˆz, D) 2

ρ Jt ˆ J 2 + c zt ˆz 2

where c MJLΦ1 + LΦ2. Hence,

Jt+1 ˆ J 2 ρ Jt ˆ J 2 + O(ρt).

Unrolling the recursion, Jt+1 ˆ J 2 O((t + 1)ρt).

Theorem 4.6 (Global forward pass code error). Let ˆz be the fixed-point of the encoder with iterations zt+1 = Φ(zt, D). Given Assumption 3.6, Lemmas 3.2, A.1 and A.2, we have ˆz z 2 O( D D 2 + ˆδ ), where ˆδ = ˆz z 2, ˆz is the unique minimizer of lasso (1) given the dictionary D, ˆz is the unique minimizer of lasso (1) given the dictionary D , and z is the ground-truth code.

Proof. We first find the error between ˆz and ˆz which is the unique minimizer of lasso (1) given the true dictionary D . Using fixed-point property (Lemma 3.2), we get

ˆz ˆz 2 = Φ(ˆz, D) Φ(ˆz , D ) 2 Φ(ˆz, D) Φ(ˆz , D) 2 + Φ(ˆz , D) Φ(ˆz , D ) 2 (107)

Using the µ-strongly convexity of Lx(zt, D) on the support, and L21 Lipschitz constants of 2 21Lx(z, D), we upper bound the term as follows:

ˆz ˆz 2 ρ ˆz ˆz 2 + αL21cprox D D 2 (108)

Where ρ cprox(1 αµ) < 1. Denote q αcprox L21

1 ρ which can be made to be small with proper choice of step size α. ˆz ˆz 2 q D D 2 (109)

Hence, we get the following code error

ˆz z 2 ˆz ˆz 2 + ˆz z 2 q D D 2 + ˆδ O( D D 2 + ˆδ ) (110)

Theorem 4.8 (Global forward pass Jacobian error). Let ˆz be the fixed-point of the encoder with iterations zt+1 = Φ(zt, D). Given Assumption 3.6, Lemmas 3.2, A.1 and A.2, we have ˆ J J 2 O( D D 2+ˆδ J), where ˆδ J := ˆ J J 2, and ˆ J, ˆ J and J are Jacobians corresponding to ˆz, ˆz and z .

Proof. First, we define J . For z , we define the mapping function z Lx(z, D), where z (D) is its minimizer evaluated at D , i.e., 1Lx(z , D ) = D T(D z x) = 0 given the generative model (x = D z ). Hence, we define the Jacobian J = z (D)

D |D=D . From implicit function theorem, we get

J + 2 11Lx(z , D ) + 2 21Lx(z , D ) = 0

which is later used in the global backward pass analysis. Alternatively, if 2 11Lx(z , D ) is invertible, then we can compute J + as follows:

J + = 2 21Lx(z , D ) 2 11Lx(z , D ) 1

Published in Transactions on Machine Learning Research (08/2022)

The Jacobian w.r.t row i of the dictionary is

J (i,:) = (D T S D S ) 1(D i,:z T + (D T i,: z xi)Ip)S

on the support S of z . Outside of the support, it is zero. Now, given the recursion zt+1 = Φ(zt, D), we differentiate the recursion, Jt+1 = 1Φ(zt, D)TJt + 2Φ(zt, D)T.

Hence, we have ˆ J = 1Φ(ˆz, D)T ˆ J + 2Φ(ˆz, D)T

ˆ J = 1Φ(ˆz , D )T ˆ J + 2Φ(ˆz , D )T

where ˆ J is the Jacobian of ˆz . Then, following similar step to Theorem 4.7, we can write

ˆ J ˆ J = 1Φ(ˆz, D)T( ˆ J ˆ J ) + ( 1Φ(ˆz, D) 1Φ(ˆz , D ))T ˆ J + ( 2Φ(ˆz, D) 2Φ(ˆz , D ))T

With respect to D, we denote the Lipschitz constants of 1Φ(ˆz, D) and 2Φ(ˆz, D) with LΦ1D and LΦ2D, respectively. Then, ˆ J ˆ J 2 ρ ˆ J ˆ J 2 + c ˆz ˆz 2 + c D ˆD D 2

where c MJLΦ1 + LΦ2 and c D MJLΦ1D + LΦ2D. Given the global forward pass code error, we get

ˆ J ˆ J 2 qz ˆz ˆz 2 + q D D D 2 (q D + qzq) D D 2 (111)

where qz c 1 ρ, q D c D 1 ρ. Hence, we get

ˆ J J 2 ˆ J ˆ J 2 + ˆ J J 2 O( D D 2 + ˆδ J) (112)

where we denote ˆδ J := ˆ J J 2

A.4 Local backward pass proof details

In each update of the dictionary, we bound the gradient approximations as function of unrolling t (Theorem 4.9). This shows that gae-lasso t converges faster than gdec t and gae-ls t , and the latter is a biased estimator of ˆg. This is followed by Theorem 4.9 showing the order magnitude of the bounds is indeed tight.

Theorem 4.9 (Local convergence of gradients). Given the forward pass convergence results (Theorems 4.3 and 4.7), ρ < 1, B > 0 such that t > B, the errors of gradients defined in Algorithm 2 w.r.t ˆg (4) satisfy

gdec t ˆg 2 O(ρt)

gae-lasso t ˆg 2 O(tρ2t)

gae-ls t ˆg 2 O(tρ2t + MJλ s).

Moreover, the order of upper bounds is tight (see Lemma A.4).

Proof. For gdec t , with the infinite fresh samples, we have limn 1

n Pn i=1 2Lxi(zi t, D) = Ex X [ 2Lx(zt, D)] a.s. Based on Lemma 3.3, we get

gdec t ˆg 2 = Ex X [ 2Lx(zt, D)] Ex X [ 2Lx(ˆz, D)] 2 Ex X [ 2Lx(zt, D) 2Lx(ˆz, D) 2] Ex X [L2 zt ˆz 2] O(ρt). (113)

Similarly, for gae-lasso t and gae-ls t , we replace the sample mean for gradient computations with expectation in their limit. We re-write the gradient estimation error as following

gae-lasso t ˆg = Ex X [Q(ˆz, Jt)(zt ˆz)] + Ex X [Q21 t (ˆz)] + Ex X [Jt Qlasso-11 t (ˆz)]

gae-ls t ˆg = Ex X [Q(ˆz, Jt)(zt ˆz)] + Ex X [Q21 t (ˆz)] + Ex X [Jt Qls-11 t (ˆz)] (114)

Published in Transactions on Machine Learning Research (08/2022)

where Q21 t (z) 2Lx(zt, D) 2Lx(z, D) 2 21Lx(z, D)(zt z)

Qlasso-11 t (z) 1Lx(zt, D) + h(zt) 2 11Lx(z, D)(zt z)

Qls-11 t (z) 1Lx(zt, D) 2 11Lx(z, D)(zt z)

Q(z, J) J+ 2 11Lx(z, D) + 2 21Lx(z, D).

We provide bounds on the above in Lemma A.5. Hence, it suffices to bound the terms on the r.h.s as follows:

gae-lasso t ˆg 2 Ex X [L1 Jt ˆ J 2 zt ˆz 2 + (L21/2) zt ˆz 2 2 + MJ(L11/2) zt ˆz 2 2]. (116)

Using the convergence errors from the forward pass (Theorems 4.3 and 4.7),

gae-lasso t ˆg 2 L1O(tρ2t) + (L21/2 + MJ(L11/2)) O(ρ2t) = O(tρ2t). (117)

gae-ls t ˆg 2 Ex X [L1 Jt ˆ J 2 zt ˆz 2 + (L21/2) zt ˆz 2 2 + MJ((L11/2) zt ˆz 2 2 + h(ˆz) 2)]. (118)

Using the convergence errors from the forward pass (Theorems 4.3 and 4.7),

gae-ls t ˆg 2 L1O(tρ2t) + (L21/2 + MJL11/2) O(ρ2t) + MJ h(ˆz) 2 = O(tρ2t + MJλ s). (119)

Lemma A.4 (Tight local bound). The order magnitude of the upper bounds in Theorem 4.9 is tight.

Proof. It is sufficient to show that there exist an example such that its forward pass code and Jacobian convergences are O(ρt) and O(tρt), respectively. The following example confirms this. Without loss of generality, let z be 1-sparse and non-negative, D = D and Dj = 0 for j = i. The loss function is 1 2 D i z (i) Diz(i) 2 2 + λ|z(i)|. Given the support recovery after first iteration, the encoder forward pass implements zt+1,(i) = zt,(i) α(DT i (Dizt,(i) D i z (i)) + λ) = (1 α)zt,(i) + α(z (i) λ). Hence, the forward pass convergences are

zt,(i) = (1 α)tz0 +

k=1 α(1 α)t k(z (i) λ) = (1 α)tz0 + (1 (1 α)t)(z (i) λ)

zt,(i) ˆz(i) = ρt(z0 z (i) + λ) = O(ρt)

and Jt,(i) = Jt 1,(i) α(Jt 1,(i) + 2Dizt,(i) D i z (i)) = ρJt 1,(i) + O(ρt) + ˆ J(i)

Jt,(i) ˆ J(i) = ρt J0,(i) +

k=1 O(ρt) = O(tρt) (121)

where ρ = 1 α, ˆz(i) = z (i) λ, and ˆ J(i) = α(2Di ˆz(i) D i z (i))

Lemma A.5 (Local bounds). From local gradient errors in Theorem 4.9, the following are satisfied

Q21 t (ˆz) 2 (L21/2) zt ˆz 2 2,

Q(ˆz, Jt) 2 L1 Jt ˆ J 2,

Qlasso-11 t (ˆz) 2 (L11/2) zt ˆz 2 2 Qls-11 t (ˆz) 2 (L11/2) zt ˆz 2 2 + h(ˆz) 2. (122)

Proof. For Q21 t (ˆz), given convexity of 1Lx(z, D) and its domain (Assumption 3.1) and Lemma 3.3, we achieve the quadratic upper bound. For Qlasso-11 t (ˆz), we add and subtract 1Lx(ˆz, D), and then use quadratic

Published in Transactions on Machine Learning Research (08/2022)

upper bound. At line four, given Lemma A.3, we use 0 1Lx(ˆz, D) + h(ˆz) and assume that zt recovers the sign entries of ˆz.

Qlasso-11 t 2 = 1Lx(zt, D) + h(zt) 2 11Lx(ˆz, D)(zt ˆz)

= 1Lx(zt, D) 1Lx(ˆz, D) + 1Lx(ˆz, D) + h(zt) 2 11Lx(ˆz, D)(zt ˆz)

(L11/2) zt ˆz 2 2 + h(zt) + 1Lx(ˆz, D) 2 (L11/2) zt ˆz 2 2 + h(zt) h(ˆz) 2 (L11/2) zt ˆz 2 2.

Qls-11 t 2 = 1Lx(zt, D) 2 11Lx(ˆz, D)(zt ˆz)

= 1Lx(zt, D) 1Lx(ˆz, D) + 1Lx(ˆz, D) 2 11Lx(ˆz, D)(zt ˆz)

(L11/2) zt ˆz 2 2 + 1Lx(ˆz, D) 2 (L11/2) zt ˆz 2 2 + h(ˆz) 2.

For Q(ˆz, Jt), from implicit function theorem, Q(ˆz, ˆ J) = 0 under the support S of ˆz that is identified by zt. To prove this, consider the minimizer ˆz(D). We have 0 1f(ˆz, D), hence, we get 0 ˆ J(D) 2 11f(ˆz, D) + 2 21f(ˆz, D). Given the support recovery, the relation ˆ J(D)( 2 11Lx(ˆz, D) 1S ) + 2 21Lx(ˆz, D) 1S = 0 also holds which is equivalent to Q(ˆz, ˆ J) under the support. To show this, given the recursion zt+1 = Φ(zt, D), we differentiate it and get Jt+1 = 1Φ(zt, D)Jt + 2Φ(zt, D). Given the support recovery and fixed-point property, we can write

ˆ J = 1S ( ˆ J α 2 11Lx(ˆz, D)T ˆ J) + 1S ( α 2 21Lx(ˆz, D)T) ˆ J 1S ˆ J = ˆ Jα 2 11Lx(ˆz, D)T 1S α 2 21Lx(ˆz, D)T 1S

0 = ˆ J+( 2 11L(ˆz, D) 1S) + 2 21Lx(ˆz, D) 1S

If the term ( 2 11L(zt, D) 1S) is invertible, then we can write

ˆ J+ = 2 21Lx(ˆz, D) 1S( 2 11L(ˆz, D) 1S) 1 (126)

For the Jacobian corresponding to row i of the dictionary, we get

ˆ J(i,:) = (DT S DS) 1(Di,: ˆz T + (DT i,: ˆz xi)Ip)S (127)

on the support. Outside of the support S, the Jacobian is zero. This proof is similarly provided by Malézieux et al. (2022). Hence, we can use 2 21Lx(ˆz, D) = ˆ J+ 2 11Lx(ˆz, D) under the support S in the following

Q(ˆz, Jt) 2 = J+ t 2 11Lx(ˆz, D) + 2 21Lx(ˆz, D) 2 = J+ t 2 11Lx(ˆz, D) ˆ J+ 2 11Lx(ˆz, D) 2

(Jt ˆ J)+ 2 11Lx(ˆz, D) 2 L1 Jt ˆ J 2. (128)

A.5 Global backward pass proof details

We re-state and proof Theorem 4.10 as follows: Theorem 4.10 (Global error of gradients). Given the convergence results from the forward pass, (Theorems 4.6 and 4.8), the errors of gradients defined in Algorithm 2 w.r.t global direction g (defined in (5)) satisfy

gae-lasso g 2 O( D D 2 2 + D D 2 + ˆδ + ˆδ J + MJλ s)

gae-ls g 2 O( D D 2 2 + D D 2 + ˆδ + ˆδ J). (13)

Proof. For gdec t , we compute the gradient in their limit assuming infinite fresh samples limn 1

n Pn i=1 2Lxi(zi t, D) = Ex X [ 2Lx(zt, D)] a.s.. Similar to Theorem 4.9, we re-write the errors of gradients gae-lasso t and gae-ls t as following

gae-lasso t g = Ex X [Q(z , Jt)(zt z )] + Ex X [Q21 t (z )] + Ex X [Jt Qlasso-11 t (z )]

gae-ls t g = Ex X [Q(z , Jt)(zt z )] + Ex X [Q21 t (z )] + Ex X [Jt Qls-11 t (z )]. (129)

Published in Transactions on Machine Learning Research (08/2022)

where Q21 t (z), Qlasso-11 t (z), Qls-11 t (z), and Q(z, J) are defined as in Theorem 4.9. Given Assumption 4.1 and Lemma A.6, we find an upper bound on the r.h.s of the gradient errors as follows:

gae-lasso t g 2 Ex X [(L1 Jt J 2 + MJL11DL21D D D 2) zt z 2 + (L21/2) zt z 2 2]

+ Ex X [MJ(L11/2) zt z 2 2 + MJ h(zt) 2 + L1D D D 2]

Ex X [L1( Jt ˆ J 2 + ˆ J J 2 + MJL11DL21D D D 2)( zt ˆz 2 + ˆz z 2)]

+ Ex X [(L21/2)( zt ˆz 2 2 + ˆz z 2 2) + L1D D D 2]

+ Ex X [MJ(L11/2)( zt ˆz 2 2 + ˆz z 2 2) + MJ h(zt) 2] (130) Similarly,

gae-ls t g 2 Ex X [(L1 Jt J 2 + MJL11DL21D D D 2) zt z 2 + (L21/2) zt z 2 2]

+ Ex X [MJ(L11/2) zt z 2 2 + L1D D D 2]

Ex X [L1( Jt ˆ J 2 + ˆ J J 2 + MJL11DL21D D D 2)( zt ˆz 2 + ˆz z 2)]

+ Ex X [(L21/2)( zt ˆz 2 2 + ˆz z 2 2)]

+ Ex X [MJ(L11/2)( zt ˆz 2 2 + ˆz z 2 2) + L1D D D 2]. (131) Using the convergence errors from the forward pass (Theorems 4.3 and 4.7),

gae-lasso t g 2 L1O(tρ2t + ( D D 2 + ˆδ )tρt + ρt( D D 2 + ˆδ J)

+ L1O( D D 2 + ˆδ )( D D 2 + ˆδ J))

+ MJL11DL21D D D 2(ρt + ( D D 2 + ˆδ ))

+ (L21/2 + MJL11/2) O(ρt + D D 2 + ˆδ ) + O( D D 2) + MJ h(zt) 2)

gae-lasso g 2 O(( D D 2 + ˆδ J)( D D 2 + ˆδ + 1) + MJλ s)

= O( D D 2 2 + D D 2 + D D 2(ˆδ + ˆδ J) + ˆδ + ˆδ J + MJλ s) (133)

gae-ls g 2 O( D D 2 2 + D D 2 + D D 2(ˆδ + ˆδ J) + ˆδ + ˆδ J) (134)

Lemma A.6 (Global bounds). From global gradient errors in Theorem 4.10, the following are satisfied

Q21 t (z ) 2 (L21/2) zt z 2 2 Qlasso-11 t (z ) 2 (L11/2) zt z 2 2 + L1D D D 2 + h(zt) 2

Qls-11 t (z ) 2 (L11/2) zt z 2 2 + L1D D D 2 Q(z , Jt) 2 L1 Jt J 2 + MJL11DL21D D D 2.

Proof. For Q21 t (z ), we achieve the quadratic bound using convexity of 1Lx(z, D) and its domain (Assumption 3.1) and Lemma 3.3. For Qlasso-11 t (z ), we add and subtract 1Lx(z , D), and use quadratic upper bound similar to Lemma A.5. At line four, we use 0 1Lx(z , D ) (Lemma A.3) and assume that zt

Published in Transactions on Machine Learning Research (08/2022)

recovers the sign entries of z (see Theorem 4.1 and Theorem 4.2).

Qlasso-11 t (z ) 2 = 1Lx(zt, D) + h(zt) 2 11Lx(z , D)(zt z ) 2 = 1Lx(zt, D) 1Lx(z , D) + 1Lx(z , D) + h(zt) 2 11Lx(z , D)(zt z ) 2 (L11/2) zt z 2 2 + h(zt) + 1Lx(z , D) 2 (L11/2) zt z 2 2 + h(zt) + 1Lx(z , D) 1Lx(z , D ) 2 (L11/2) zt z 2 2 + L1D D D 2 + h(zt) 2.

Qls-11 t (z ) 2 = 1Lx(zt, D) 2 11Lx(z , D)(zt z ) 2 = 1Lx(zt, D) 1Lx(z , D) + 1Lx(z , D) 2 11Lx(z , D)(zt z ) 2 (L11/2) zt z 2 2 + 1Lx(z , D) 2 (L11/2) zt z 2 2 + 1Lx(z , D) 1Lx(z , D ) 2 (L11/2) zt z 2 2 + L1D D D 2. (137) For Q(z , Jt), from implicit function theorem, Q(z , J ) = 0 for D evaluated at D . Hence, we can use 2 21Lx(z , D ) = J + 2 11Lx(z , D ) in the following

Q(z , Jt) 2 = J+ t 2 11Lx(z , D) + 2 21Lx(z , D) 2 21Lx(z , D ) + 2 21Lx(z , D ) 2 = J+ t 2 11Lx(z , D) J + 2 11Lx(z , D ) 2 + 2 21Lx(z , D) 2 21Lx(z , D ) 2 = J+ t 2 11Lx(z , D) J+ t 2 11Lx(z , D ) 2 + J+ t 2 11Lx(z , D ) J + 2 11Lx(z , D ) 2 + L21D D D 2 = MJL11D D D 2 + J+ t 2 11Lx(z , D ) J + 2 11Lx(z , D ) 2 + L21D D D 2 (J+ t J +) 2 11Lx(z , D) 2 + MJL11DL21D D D 2 L1 Jt J 2 + MJL11DL21D D D 2. (138)

Theorem 4.11 (Dictionary learning with variable λt). Given Assumptions 3.3 and 3.4, suppose D(l) is µl-incoherent and (δl, 2)-close to D with δl = O (1/ log p). If s = O( m), µ = O(log m), learning rate is η = O( p s(1 δ2 l /2)), and the regularizer and step size are set according to Theorem 4.4, then for any dictionary

update l using gdec T , with probability of at least 1 ϵ(l) supp-pres,

D(l+1) j D j 2 2 (1 ψ) D(l) j D j 2 2 (15)

where ϵ(l) supp-pres := ϵ(l) supp-rec + ϵ(l) γ = 2p exp ( C2 min O (δ2 l )) + 2s exp ( 1 O(δl)).

Proof. In this proof, we study gdec T,j, and for ease of notation we drop the superscript.

g(l) T,j = E[1z T,(j) =0z T,(j)(D(l)z T x)] = E[1z (j) =0z T,(j)(D(l)z T x)] + γ (139)

where γ = E[(1z T,(j) =0 1z (j) =0)z T,(j)(D(l)z T x)]. We have the event 1z T,(j) =0 1z (j) =0 = 0 happening

with probability of 1 ϵ(l) supp-pres, and ϵ(l) supp-pres decreases with decrease in δl. Hence, γ gets smaller. We write

g(l) T,j = E[1z (j) =0z T,(j)(D(l)z T x)] + γ (140)

where B(l) S is an diagonal matrix with β(l) j for j S as entries. For j / S, 1z (j) =0 = 0, which results in

g(l) T,j = 0. Hence, we only focus on j S where 1z (j) =0 = 1. We condition on the support and re-write the

Published in Transactions on Machine Learning Research (08/2022)

gradient as

g(l) T,j = E[z T,(j)(D(l)z T x)] + γ

= E[E[z T,(j)[D(l) S (I B(l) S )z (S) + D(l) S ζ(l) T,S D Sz (S)] | S] + γ

= E[D(l) S (I B(l) S )E[z T,(j)z (S) | S]] E[D SE[z T,(j)z (S) | S]] + E[D(l) S E[z T,(j)ζ(l) T,S] | S] + γ

= E[D(l) S (I B(l) S )E[(z (j)(1 β(l) j ) + ζ(l) T,j)z (S) | S]] E[D SE[(z (j)(1 β(l) j ) + ζ(l) T,j)z (S) | S]]

+ E[D(l) S E[(z (j)(1 β(l) j ) + ζ(l) T,j)ζ(l) T,S] | S] + γ

= E[D(l) j (1 β(l) j )2] E[D j (1 β(l) j )] + γ

+ E[D(l) S (I B(l) S )E[z (S)ζ(l) T,j | S]] E[D SE[z (S)ζ(l) T,j | S]] + E[D(l) S E[(z (j)(1 β(l) j ) + ζ(l) T,j)ζ(l) T,S | S]] (141) where in the last line, we use the fact that E[z (j) | j S] = 0 and E[z (S)z T (S) | S] = I. Computing the expectation, we get

g(l) T,j = pj D(l) j (1 β(l) j )2 pj D j (1 β(l) j ) + U (l) T,j + γ = pj(1 β(l) j ) (1 β(l) j )D(l) j D j + U (l) T,j + γ

(142) where U (l) T,j = E[D(l) S (I B(l) S )E[z (S)ζ(l) T,j | S]] E[D SE[z (S)ζ(l) T,j | S]]+E[D(l) S E[(z (j)(1 β(l) j )+ζ(l) T,j)ζ(l) T,S | S]].

Given this gradient, we now find a bound on U (l) T,j.

U (l) T,j = E[D(l) S (I B(l) S )E[z (S)ζ(l) T,j | S]] E[D SE[z (S)ζ(l) T,j | S]] + E[D(l) S E[(z (j)(1 β(l) j ) + ζ(l) T,j)ζ(l) T,S | S]] (143) First, we bound E[z (i)ζ(l) T,j | S]] as following.

E[z (i)ζ(l) T,j | S]]

t=1 α(1 α)T t E[2γ(l) j z (i) | S] + µl m

t=1 α(1 α)T t X

k =j E[Et 1,ksign(z (k) zt 1,(k))z (i) | S]

t=1 α(1 α)T t X

k =j E[Et 1,ksign(z (k) zt 1,(k))sign(zt,(i))z (i) | S] + κ(l) T,j

(144) where κ(l) T,j = E[z (i)κ(l) T,j]. Similar to κ(l) T,j, κ(l) T,j decay very fast as T increases. Hence, we bound the other terms. We have

E[γ(l) j z (i) | S]

( δl if j = i = 0 if j = i , (145)

E[Et 1,ksign(z (k) zt 1,(k))z (i) | S]

( Et 1,k if k = i = 0 if k = i , (146)

E[Et 1,ksign(z (k) zt 1,(k))sign(zt,(k))z (i) | S]

( Et 1,i if k = i = 0 if k = i (147)

k =j E[Et 1,ksign(z (k) zt 1,(k))z (i) | S]

( Et 1,i if j = i = 0 if j = i (148)

k =j E[Et 1,ksign(z (k) zt 1,(k))sign(zt,(k))z (i) | S]

( Et 1,i if j = i 0 if j = i (149)

Published in Transactions on Machine Learning Research (08/2022)

Hence, for j = i, we can write

E[z (i)ζ(l) T,j | S]] 2δl + 2 µl m

t=1 α(1 α)T t Et 1,i + κ(l) T,j (150)

where from (103), we have

Et 1,i(1 α)T t (1 + 2(s 1)α µl m(t 1))E0,maxδα,T 2

+ (β(l) max|z max| + 2γ(l) max)

k=1 α(1 α)T k 1 + 2(s 1)κl(1 α)T t ! (151)

Hence, given the sparsity level, the term below is bounded by aγ with probability of 1 ϵ(l) γ .

t=1 Et 1,i(1 α)T t

t=1 (1 + 2(s 1)α µl m(t 1))E0,maxδα,T 2

+ (β(l) max|z max| + 2γ(l) max)

k=1 α(1 α)T k 1 +

t=1 2(s 1)κl(1 α)T t !

(β(l) max|z max| + 2a(l) γ )(1 + sκl) = O(a(l) γ )

Finally, we get

E[z (i)ζ(l) T,j | S]]

( 2δl + µl m O(a(l) γ ) + κ(l) T,j if j = i

κ(l) T,j if j = i (153)

For appropriately large T, κ(l) T,j can be make small; Hence, in this case, we get

U (l) T,j 2 O( ppijδl D(l) 2) (154)

Now, we can re-write the gradient as

g(l) T,j = pj(1 β(l) j )(D(l) j D j ) + pj( β(l) j D(l) j + 1

pj U (l) T,j + 1

pj γ) = τ(D(l) j D j ) + θ (155)

where τ = pj(1 β(l) j ), and θ = pj( β(l) j D(l) j + 1

pj U (l) T,j + 1

pj γ). We can bound the norm of θ as follows:

θ 2 pjβ(l) j D(l) j 2 + U (l) T,j 2 + γ (156)

Given D(l) j 2 = 1, and β(l) j = D j D(l) j , D j = 1

2 D(l) j D j 2 2, we modify the upper bound

2pj D(l) j D j 2 2 + O( ppijδl D(l) 2) + γ (157)

We assume a dictionary closeness during training, i.e., D(l) D 2 2 D 2, which we prove in Lemma A.7. Given this closeness, we have

D(l) 2 D(l) D 2 + D 2 = O( r p

Moreover, with γ dropping with δl, and for s = O( m), it is reduced to

θ 2 pj D(l) j D j 2 (159)

We get g(l) T,j 2 pj(1 β(l) j ) D(l) j D j 2 + pj D(l) j D j 2

g(l) T,j 2 2 p2 j(2 β(l) j )2 D(l) j D j 2 2 (160)

Published in Transactions on Machine Learning Research (08/2022)

Using this bound, we can find a lower bound on the correlation between the gradient direction and the desired direction as follows

g(l) T,j 2 2 = (pj(1 β(l) j ))2 D(l) j D j 2 2 + θ 2 2 + 2pj(1 β(l) j ) θ, D(l) j D j

2 θ, D(l) j D j = pj(1 β(l) j ) D(l) j D j 2 2 + 1

pj(1 β(l) j ) g(l) T,j 2 2 1

pj(1 β(l) j ) θ 2 2

2 g(l) T,j, D(l) j D j = pj(1 β(l) j ) D(l) j D j 2 2 + 1

pj(1 β(l) j ) g(l) T,j 2 2 1

pj(1 β(l) j ) θ 2 2

(pj(1 β(l) j ) pj 1

1 β(l) j ) D(l) j D j 2 2 + 1

pj(1 β(l) j ) g(l) T,j 2 2

Hence, using the descent property of Theorem 6 from (Arora et al., 2015), setting the learning rate to η = maxj 1 pj(1 β(l) j ), and ψ = η(pj(1 β(l) j ) pj 1 1 β(l) j ) 1 1 (1 β(l) j )2

D(l+1) j D j 2 2 (1 ψ) D(l) j D j 2 2 (162)

Theorem 4.12 (Dictionary learning with fixed λt). Given Assumptions 3.3 and 3.4, suppose D(l) is µlincoherent and (δl, 2)-close to D with δl = O (1/ log p). If s = O( m), µ = O(log m), learning rate is η = O( p s(1 δ2 l /2)), and the regularizer λfixed and step size are set according to Theorem 4.5, then for any

dictionary update l using gdec T , with probability of at least 1 ϵ(l) supp-pres,

D(l+1) j D j 2 2 (1 ψ) D(l) j D j 2 2 + ϵ(l) λ (16)

where ϵ(l) λ := η 2p s(1 β(l) j )λfixed2, ϵ(l) supp-pres := ϵ(l) supp-rec + ϵ(l) γ = 2p exp ( C2 min O (δ2 l )) + 2s exp ( 1 O(δl)).

Proof. Following steps similar to Theorem 4.11, we write the gradient as

g(l) T,j = pj D(l) j (1 β(l) j )2 pj D j (1 β(l) j ) + U (l) T,j + γ = pj(1 β(l) j ) (1 β(l) j )D(l) j D j + U (l) T,j + γ

(163) where U (l) T,j = E[D(l) S (I B(l) S )E[z (S)ζ(l) T,j | S]] E[D SE[z (S)ζ(l) T,j | S]]+E[D(l) S E[(z (j)(1 β(l) j )+ζ(l) T,j)ζ(l) T,S | S]].

Given this gradient, we now find a bound on U (l) T,j. First, we bound E[z (i)ζ(l) T,j | S]] as following.

E[z (i)ζ(l) T,j | S]]

t=1 α(1 α)T t E[γ(l) j z (i) | S] +

t=1 α(1 α)T t E[λfixedsign(zt 1,(j))z (i) | S]

t=1 α(1 α)T t X

k =j E[Et 1,ksign(z (k) zt 1,(k))z (i) | S] + κ(l) T,j

where we set all λ(l) t 1,j = λfixed and κ(l) T,j = E[z (i)κ(l) T,j]. Similar to κ(l) T,j, κ(l) T,j decay very fast as T increases. Hence, we bound the other terms. We have

E[γ(l) j z (i) | S]

( δl if j = i = 0 if j = i , (165)

E[λfixedsign(zt 1,(j))z (i) | S]

( = λfixed if j = i = 0 if j = i , (166)

E[Et 1,ksign(z (k) zt 1,(k))z (i) | S]

( Et 1,k if k = i = 0 if k = i (167)

Published in Transactions on Machine Learning Research (08/2022)

k =j E[Et 1,ksign(z (k) zt 1,(k))z (i) | S]

( Et 1,i if j = i = 0 if j = i (168)

Hence, for j = i, we can write

E[z (i)ζ(l) T,j | S]] δl + µl m

t=1 α(1 α)T t(E0,i + Et 1,i) + κ(l) T,j (169)

Et 1,i(1 α)T t (1 + 2(s 1)α µl m(t 1))E0,maxδα,T 2

+ (β(l) max|z max| + 2γ(l) max)

k=1 α(1 α)T k 1 + 2(s 1)κl(1 α)T t ! (170)

Hence, given the sparsity level, the term below is bounded by aγ with probability of 1 ϵ(l) γ .

t=1 Et 1,i(1 α)T t

t=1 (1 + 2(s 1)α µl m(t 1))E0,maxδα,T 2

+ (β(l) max|z max| + 2γ(l) max)

k=1 α(1 α)T k 1 +

t=1 2(s 1)κl(1 α)T t !

(β(l) max|z max| + 2a(l) γ )(1 + sκl) = O(a(l) γ )

Finally, we get

E[z (i)ζ(l) T,j | S]]

( δl + µl m O(a(l) γ ) + κ(l) T,j if j = i

λfixed + κ(l) T,j if j = i (172)

For appropriately large T, κ(l) T,j can be make small; Hence, in this case, we get

U (l) T,j 2 O( ppijδl D(l) 2 + pjλfixed) (173)

Now, we can re-write the gradient as

g(l) T,j = pj(1 β(l) j )(D(l) j D j ) + pj( β(l) j D(l) j + 1

pj U (l) T,j + 1

pj γ) = τ(D(l) j D j ) + θ (174)

where τ = pj(1 β(l) j ), and θ = pj( β(l) j D(l) j + 1

pj U (l) T,j + 1

pj γ). We can bound the norm of θ as follows:

θ 2 pjβ(l) j D(l) j 2 + U (l) T,j 2 + γ (175)

Given D(l) j 2 = 1, and β(l) j = D j D(l) j , D j = 1

2 D(l) j D j 2 2, we modify the upper bound

2pj D(l) j D j 2 2 + O( ppijδl D(l) 2 + pjλfixed) + γ (176)

We assume a dictionary closeness during training, i.e., D(l) D 2 2 D 2, which we prove in Lemma A.7. Given this closeness, we have

D(l) 2 D(l) D 2 + D 2 = O( r p

Moreover, with γ dropping with δl, and for s = O( m), it is reduced to

θ 2 pj( D(l) j D j 2 + λfixed) (178)

Published in Transactions on Machine Learning Research (08/2022)

We get g(l) T,j 2 pj(1 β(l) j ) D(l) j D j 2 + pj( D(l) j D j 2 + λfixed)

g(l) T,j 2 2 2p2 j(2 β(l) j )2 D(l) j D j 2 2 + 2p2 jλfixed2 (179)

Using this bound, we can find a lower bound on the correlation between the gradient direction and the desired direction as follows

g(l) T,j 2 2 = (pj(1 β(l) j ))2 D(l) j D j 2 2 + θ 2 2 + 2pj(1 β(l) j ) θ, D(l) j D j

2 θ, D(l) j D j = pj(1 β(l) j ) D(l) j D j 2 2 + 1

pj(1 β(l) j ) g(l) T,j 2 2 1

pj(1 β(l) j ) θ 2 2

2 g(l) T,j, D(l) j D j = pj(1 β(l) j ) D(l) j D j 2 2 + 1

pj(1 β(l) j ) g(l) T,j 2 2 1

pj(1 β(l) j ) θ 2 2

(pj(1 β(l) j ) 2pj 1

1 β(l) j ) D(l) j D j 2 2 + 1

pj(1 β(l) j ) g(l) T,j 2 2 2pj (1 β(l) j ) λfixed2

(180) Hence, using the descent property of Theorem 6 from (Arora et al., 2015), setting the learning rate to η = maxj 1 pj(1 β(l) j ), and ψ = η(pj(1 β(l) j ) 2pj 1 1 β(l) j )

D(l+1) j D j 2 2 (1 ψ) D(l) j D j 2 2 + ϵλ (181)

where ϵλ := η 2pj (1 β(l) j )λfixed2

Lemma A.7 (Dictionary maintains closeness). Suppose D(l) has (δl, 2)-closeness to D where δl = O (1/ log m), then with probability of 1 ϵ(l) supp-pres, we have D(l+1) D 2 2 D 2 when using gdec T and the network parameters set by Theorem 4.4.

Proof. Given the dictionary update

D(l+1) j D j = D(l) j D j ηg(l) T,j (182)

Then, with probability at least 1 ϵ(l) supp-pres, we have the gradient

g(l) T,j = pj(1 β(l) j )(D(l) j D j ) + pj( β(l) j D(l) j + 1

pj U (l) T,j + 1

pj γ) (183)

which we substitute in the dictionary update as below

D(l+1) j D j = (1 η(pj(1 β(l) j )))(D(l) j D j ) + ηpjβ(l) j D(l) j ηU (l) T,j ηγ) (184)

writing the update in matrix form

D(l+1) D = (D(l) D )diag(1 η(pj(1 β(l) j ))) + ηD(l)diag(pjβ(l) j ) ηD(l)F + ηD H ηγ) (185)

where F(ij) = pij E[(1 β(l) i )z (i)ζ(l) T,j | S] + pij E[(z (j)(1 β(l) j ) + ζ(l) T,j)ζ(l) T,i | S], and H(ij) = pij E[z (i)ζ(l) T,j | S].

Given, the bound U (l) T,j 2 O( ppijδl D(l) 2) from before, we get F F O(ppijδl) and H F O(ppijδl). Hence,

D(l)F + D H 2 D(l) 2 F F + D 2 H F = O(ppijδl D 2) = O( s2

p log m) D 2 (186)

Using the maintained closeness at update l, we bound the terms in the dictionary update one by one below

(D(l) D )diag(1 η(pj(1 β(l) j ))) (1 min j ηpj(1 β(l) j )) D(l) D 2 2(1 Ω(ηs/p)) D 2 (187)

Published in Transactions on Machine Learning Research (08/2022)

D(l)diag(pjβ(l) j ) 2 max j pj δ2 l 2 D(l) D + D 2 o(s/p) D 2 (188)

Given the bounds above, the dictionary update can be bounded as following

D(l+1) D 2 2(1 Ω(ηs/p)) D 2 + o(ηs/p) D 2 + O( ηs2

p log m) D 2 + ηγ 2 D 2 (189)

A.6 Interpretability

Theorem 5.1 (Interpretable unrolled network). Consider the dictionary learning optimization of the form min Z,D 1 2 X DZ 2 F + λ Z 1 + ω/2 D 2 F , where X = [x1, x2, . . . , xn] Rm n and Z = [z1, z2, . . . , zn] Rp n. Let Z be the given converged sparse codes, then stationary points of the problem w.r.t the network weights (dictionary) follows D = XG 1 ZT, where we denote G := ( ZT Z + ωI)la.

Proof. For all stationary points, the objective gradient is 0 with respect to the dictionary, i.e.,

0 = (X D Z) ZT + ω D (190)

where D is the learned dictionary at convergence. Re-aranging the terms, we get

D = X ZT( Z ZT + ωI) 1 (191)

Using the relation AT(AAT + ωI) 1 = (ATA + ωI) 1AT, we can re-write the solution as

D = XG 1 ZT (192)

where we denote G := ( ZT Z + ωI).

B Appendix - future works and limitations

Beyond dictionary learning Our results are founded on three main properties: Lipschitz differentiability of the loss, proximal gradient descent, and strong convexity in finite-iteration. The findings can be applied to other min-min optimization problems, e.g., ridge regression and logistic regression, following such properties. For example, our analysis generalizes to the unrolled network in (Tolooshams et al., 2020) for learning dictionaries using data from the natural exponential family. In this case, the least-squares loss is replaced with negative log-likelihood, and the dictionary models the data expectation.

Limitations Finite-iteration support selection (Proposition 4.1) (Hale et al., 2007) and strong convexity may seem stringent going beyond dictionary learning. Ablin et al. (2020) discuss generalization of local gradient convergence by relaxing strong convexity to the p-Łojasiewicz property (Ablin et al., 2020; Attouch & Bolte, 2009). We considered the noiseless setting and conjecture that the relative comparison of the gradients in the presence of noise still holds, where the upper bounds will involve an additional noise term. We focused on infinite sample convergence to highlight the relative differences between the gradients. We leave for future work the derivation of finite-sample bounds, a step similar to (Chatterji & Bartlett, 2017; Arora et al., 2015).

C Appendix - details of experiments

PUDLE is developed using Py Torch (Paszke et al., 2017). We used one Ge Force GTX 1080 Ti GPU.

C.1 Numerical experiments for theories

Published in Transactions on Machine Learning Research (08/2022)

0 50 100 Signal length

Figure 11: Example of code estimates with the initialized dictionary.

Dataset We generated n=10,000 samples following (2). We sampled D R50 100 from zero-mean Gaussian distribution, and normalized the columns. The codes are 5-sparse with their support uniformly chosen at random and their amplitudes are sampled from Uniform(1, 2).

Training We let T = 200, λ = 0.2, and α = 0.2. The dictionary is initialized to D = D + τBB with B N(0, 1

m I). For Figures 2, 3 and 4a, we set τB 0.55/log m. Figure 11 shows the sparse code estimates from one example given this initialized dictionary; this is to highlight that a) the initial dictionary is not very close to the groundtruth dictionary, and b) our algorithm is able to successfully perform dictionary learning and recover the support by the end of training in spite of a failed exact recovery of the support.

For Figures 4b and 4c, we chose much larger noise level, τB 2.8/log m. The network is trained for 600 epochs with full-batch gradient descent using Adam optimizer (Kingma & Ba, 2014) with learning rate of 10 3 and ϵ = 10 8. The learned dictionary is evaluated based on the error D D 2/ D 2. The results and conclusion were consistent across various realizations of the dataset and across various optimizers. Hence, in the main paper, the figures visualize results of one realization.

Noisy measurements We repeated the dictionary learning experiments shown in Figures 4b and 4c where the measurements x are corrupted by zero-mean Gaussian noise such that the SNR is approximately 12 d B. Accordingly, we set λ = 0.3. Figure 12 shows the results for both noisy and noiseless scenarios.

0 250 500 Number of epochs

||D D ||2 ||D ||2

(a) Noisy (T = 25).

0 250 500 Number of epochs

||D D ||2 ||D ||2

(b) Noiseless (T = 25).

0 250 500 Number of epochs

||D D ||2 ||D ||2

(c) Noisy (T = 100).

0 250 500 Number of epochs

||D D ||2 ||D ||2

gdec t gae lasso t gae ls t

(d) Noiseless (T = 100).

Figure 12: Dictionary learning in noisy and noiseless scenarios.

Stochastic Dictionary Learning In addition to the full-batch gradient descent results in the main paper, we repeated the experiments in Figure 4c using batch size of 4, 16 and 64. We observed (Figure 13) that in all scenarios PUDLE is able to learn a good estimate of the ground-truth dictionary, and gae-ls t is superior to the other two. We note that for lower batch-size, there will be more gradient updates in one epoch, hence, the algorithm converges in lower number of epochs.

5 10 Number of epochs

||D D ||2 ||D ||2

(a) Batch size 4 (T = 25).

10 20 Number of epochs

||D D ||2 ||D ||2

(b) Batch size 16 (T = 100).

0 20 40 Number of epochs

||D D ||2 ||D ||2

gdec t gae lasso t gae ls t

(c) Batch size 64 (T = 100).

Figure 13: Dictionary learning using various batch sizes.

Effect of learning rate We performed the experiments in Figure 4c for various learning rate of 10 4, 10 3, and 10 2. This shows the robustness of the gradient-based dictionary learning against learning rate. Figure 14

Published in Transactions on Machine Learning Research (08/2022)

demonstrates such results where PUDLE successfully converges to the neighbourhood of the ground-truth dictionary; Regardless of the learning rate, gae-ls t converges to a closer neighbourhood than the other two gradients. Overall, smaller the learning rate, more epochs is needed to reach convergence.

0 100 Number of epochs

||D D ||2 ||D ||2

(a) Learning rate 10 2.

0 250 500 Number of epochs

||D D ||2 ||D ||2

gdec t gae lasso t gae ls t

(b) Learning rate 10 3.

0 2000 4000 Number of epochs

||D D ||2 ||D ||2

(c) Learning rate 10 4.

Figure 14: Dictionary learning for various learning rates when T = 100.

Dictionary initialization We conducted similar experiments to Figures 4b and 4c. We let n = 10,000, m = 100, and p = 100. We generated an orthogonal D . The sparse codes z are 5-sparse and their amplitudes are drawn from sub-Gaussian N(0, 1). We set λ = 0.05, and α = 0.2. We used the pairwise method proposed by Arora et al. (2015) to initialize the dictionary. This close initialization resulted in a dictionary that provides support recovery prior training. Figure 15 shows successful dictionary learning where gae-ls t converges to a closer neighbourhood of D than the other two gradients. We used linear sum assignment optimization (i.e., scipy.optimize.linear_sum_assignment) to find the correct column permutations before computing the dictionary distance error.

0 50 100 150 200 250 Number of epochs

||D D ||2 ||D ||2

(a) T = 25.

0 50 100 150 200 250 Number of epochs

||D D ||2 ||D ||2

gdec t gae lasso t gae ls t

(b) T = 100.

Figure 15: Dictionary learning when D is initialized using the pairwise method Arora et al. (2015).

C.2 Dictionary learning

Dataset We generated n=50,000 samples following (2). We let m=1000 and p=1500, and sample D from zero-mean Gaussian distribution, and then normalized the columns. The sparse codes zi are 10, 20, 40-sparse, where their supports are chosen uniformly at random and amplitudes are sampled from Uniform(1, 2).

Training The dictionary is initialized to D = D + τBB with B N(0, 1

m I) where τB 1/log m. We let λ = 0.2, and α = 0.2, and T = 100. The network is trained for 1,000 iterative updates with batch-size of 50 using Adam (Kingma & Ba, 2014) with learning rate of 10 3 and ϵ = 10 3. For decay method, ν is decreased in value by 0.005 every 100 update iterations. Each filter is normalized after every update. The learned dictionary is evaluated based on the relative error D D 2/ D 2.

Published in Transactions on Machine Learning Research (08/2022)

0 250 500 750 Number of iterations

||D D ||2 ||D ||2

gae ls t gae ls,decay t gae ls,HT t NOODL SPORCO

(a) 10-sparse code.

0 250 500 750 Number of iterations

||D D ||2 ||D ||2

gae ls t gae ls,decay t gae ls,HT t NOODL SPORCO

(b) 20-sparse code.

0 250 500 750 Number of iterations

||D D ||2 ||D ||2

gae ls t gae ls,decay t gae ls,HT t NOODL SPORCO

(c) 40-sparse code.

Figure 16: Dictionary learning convergence using gae-ls t compared to NOODL and SPORCO.

C.3 Image denoising

Training We trained PUDLE where the dictionary is convolutional with 64 filters of size 9 9 and strides of 4. The encoder unrolls for T = 15, and the step size is set to α = 0.1. Unlike the theoretical analysis where full-batch gradient descent is studied, the network is trained stochastically with Adam optimizer (Kingma & Ba, 2014) with a learning rate of 10 4 and ϵ = 10 3 for 250 epochs. At every training iteration, a random 129 129 patch is cropped and a zero-mean Gaussian noise with a standard deviation of 25 is added. We utilize random horizontal and vertical flip for augmentation. We report results in terms of the peak signal-to-noise ratio (PSNR). The standard deviation of the test PSNR across multiple noise realizations was lower than 0.02 d B for all the methods. Hence, we only reported the mean PSNR of the test set.

C.4 Interpretable sparse coding and dictionary learning

We focused on digits of {0, 1, 2, 3, 4} of MNIST. We set T = 15, λ = 0.7, and α = 1. The dictionary dimensions are m = 784 and p = 500. We trained the network for 200 epochs using Adam optimizer with a learning rate of 10 4 and batch size of 32. For construction of G, ω is set to 0.001. For Figure 8, we computed the image contributions using 6,000 randomly chosen training images. The Gram matrix used in Figure 9, is constructed by 6,000 training examples, and the reconstruction is from the 200 most contributed training images.