# online_convolutional_sparse_coding_with_sampledependent_dictionary__e340f80a.pdf

Online Convolutional Sparse Coding with Sample-Dependent Dictionary

Yaqing Wang 1 Quanming Yao 1 2 James T. Kwok 1 Lionel M. Ni 3

Convolutional sparse coding (CSC) has been popularly used for the learning of shift-invariant dictionaries in image and signal processing. However, existing methods have limited scalability. In this paper, instead of convolving with a dictionary shared by all samples, we propose the use of a sample-dependent dictionary in which each ﬁlter is a linear combination of a small set of base ﬁlters learned from data. This added ﬂexibility allows a large number of sample-dependent patterns to be captured, which is especially useful in the handling of large or high-dimensional data sets. Computationally, the resultant model can be efﬁciently learned by online learning. Extensive experimental results on a number of data sets show that the proposed method outperforms existing CSC algorithms with signiﬁcantly reduced time and space complexities.

1. Introduction

Convolutional sparse coding (CSC) (Zeiler et al., 2010) has been successfully used in image processing (Gu et al., 2015; Heide et al., 2015), signal processing (Cogliati et al., 2016), and biomedical applications (Pachitariu et al., 2013; Andilla & Hamprecht, 2014; Chang et al., 2017; Jas et al., 2017; Peter et al., 2017). It is closely related to sparse coding (Aharon et al., 2006), but CSC is advantageous in that its shift-invariant dictionary can capture shifted local patterns common in signals and images. Each data sample is then represented by the sum of a set of ﬁlters from the dictionary convolved with the corresponding codes.

Traditional CSC algorithms operate in the batch mode

1Department of Computer Science and Engineering, Hong Kong University of Science and Technology University, Hong Kong 24Paradigm Inc, Beijing, China. 3Department of Computer and Information Science, University of Macau, Macau. Correspondence to: jamesk@cse.ust.hk <James T. Kwok>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

(Kavukcuoglu et al., 2010; Zeiler et al., 2010; Bristow et al., 2013; Heide et al., 2015; ˇSorel & ˇSroubek, 2016; Wohlberg, 2016; Papyan et al., 2017), which take O(NK2P + NKP log P) time and O(NKP) space (where N is the number of samples, K is the number of ﬁlters, and P is the dimensionality. Recently, a number of online CSC algorithms have been proposed for better scalability (Degraux et al., 2017; Liu et al., 2017; Wang et al., 2018). As data samples arrive, relevant information is compressed into small history statistics, and the model is incrementally updated. In particular, the state-of-the-art OCSC algorithm (Wang et al., 2018) has the much smaller time and space complexities of O(K2P + KP log P) and O(K2P), respectively.

However, the complexities of OCSC still depend quadratically on K, and cannot be used with a large number of ﬁlters. The number of local patterns that can be captured is thus limited, and may lead to inferior performance especially on higher-dimensional data sets. Besides, the use of more ﬁlters also leads to a larger number of expensive convolution operations. Rigamonti et al. (2013) and Sironi et al. (2015) proposed to post-process the learned ﬁlters by approximating them with separable ﬁlters, making the convolutions less expensive. However, as learning and post-processing are then two independent procedures, the resultant ﬁlters may not be optimal. Moreover, these separate ﬁlters cannot be updated online with the arrival of new samples.

Another direction to scale up CSC is via distributed computation (Bertsekas & Tsitsiklis, 1997). By distributing the data and workload onto multiple machines, the recent consensus CSC algorithm (Choudhury et al., 2017) can handle large, higher-dimensional data sets such as videos, multispectral images and light ﬁeld images. However, the heavy computational demands of the CSC problem are only shared over the computing platform, but not fundamentally reduced.

In this paper, we propose to approximate the possibly large number of ﬁlters by a sample-dependent combination of a small set of base ﬁlters learned from the data. While the standard CSC dictionary is shared by all samples, we propose each sample to have its own personal dictionary to compensate for the reduced ﬂexibility of using these

Online Convolutional Sparse Coding with Sample-Dependent Dictionary

base ﬁlters. In this way, the representation power can remain the same but with a reduced number of parameters. Computationally, this structure also allows efﬁcient online learning algorithms to be developed. Speciﬁcally, the base ﬁlter can be updated by the alternating direction method of multipliers (ADMM) (Boyd et al., 2011), while the codes and combination weights can be learned by accelerated proximal algorithms (Yao et al., 2017). Extensive experimental results on a variety of data sets show that the proposed algorithm is more efﬁcient in both time and space, and outperforms existing batch, online and distributed CSC algorithms.

The rest of the paper is organized as follows. Section 2 brieﬂy reviews convolutional sparse coding. Section 3 describes the proposed algorithm. Experimental results are presented in Section 4, and the last section gives some concluding remarks.

Notations: For a vector a, its ith element is a(i), 1-norm is kak1 = P

i |a(i)| and 2-norm is kak2 =

i a2(i). The convolution of two vectors a and b is denoted a b. For a matrix A, A? is its complex conjugate, and A = (A>)? its conjugate transpose. The Hadamard product of two matrices A and B is (A B)(i, j) = A(i, j)B(i, j). The identity matrix is denoted I. F(x) is the Fourier transform that maps x from the spatial domain to the frequency domain, while F 1(x) is the inverse operator which maps F(x) back to x.

2. Review: Convolutional Sparse Coding

Given samples {x1, . . . , x N} in RP , CSC learns a shiftinvariant dictionary D 2 RM K, with the columns {D(: , k)} representing the K ﬁlters. Each sample xi is encoded as Zi 2 RP K, with the kth column being the code convolved with ﬁlter D(:, k). The dictionary and codes are learned together by solving the optimization problem:

min D2D,{Zi}

D(:, k) Zi(:, k)

(1) where the ﬁrst term measures the signal reconstruction error, D = {D : k D(:, k)k2 1, 8k = 1, . . . , K} ensures that the ﬁlters are normalized, and β > 0 is a regularization parameter controlling the sparsity of Zi s.

Convolution in (1) is performed in the spatial domain. This takes O(KPM) time, and is expensive. In contrast, recent CSC methods perform convolution in the frequency domain, which takes O(KP log P) time (Mallat, 1999) and is faster for typical choices of M and P. Let xi F(xi),

D(:, k) F(D(:, k)), and Zi(:, k) F(Zi(:, k)) be the Fourier-transformed counterparts of xi, D(:, k) and Zi(:, k). The codes and ﬁlters are updated in an alternating manner by block coordinate descent, as:

Update Codes: Given D, each Zi is independently obtained as

D(:, k) Zi(:, k)

+ βk Uik1 (2)

s.t. F(Ui(:, k)) = Zi(:, k), k = 1, . . . , K,

where Ui is an auxiliary variable.

Update Dictionary: D is updated by solving

D(:, k) Zi(:, k)

s.t. F(V (:, k)) = D(:, k), k = 1, . . . , K,

k C(V (:, k))k2

2 1, k = 1, . . . , K,

where V is an auxiliary variable, and C( ) crops the extra P M dimensions in V (:, k).

Both (2) and (3) can be solved by the alternating direction method of multipliers (ADMM) (Boyd et al., 2011). Subsequently, { Zi} and D can be transformed back to the spatial domain as Zi(:, k) = F 1( Zi(:, k)) and D(: , k) = C(F 1( D(:, k))). Note that while Zi s (in the spatial domain) are sparse, the FFT-transformed Zi s are not.

On inference, given the learned dictionary D, the testing sample xj is reconstructed as PK

k=1 D(:, k) Zj(:, k), where Zj is the obtained code.

2.1. Post-Processing for Separable Filters

Filters obtained by CSC are non-separable and subsequent convolutions may be slow. To speed this up, they can be postprocessed and approximated by separable ﬁlters (Rigamonti et al., 2013; Sironi et al., 2015). Speciﬁcally, the learned D is approximated by SW, where S 2 RM R contains R rank-1 base ﬁlters {S(:, 1), . . . , S(:, R)}, and W 2 RR K

contains the combination weights. However, this often leads to performance deterioration.

2.2. Online CSC

An online CSC algorithm (OCSC) is recently proposed in (Wang et al., 2018). Given the Fourier-transformed sample xt and dictionary Dt 1 from the last iteration, the corresponding { Zt, Ut} are obtained as in (2). The following Proposition updates Dt and Vt by reformulating (3) for use with smaller history statistics.

Proposition 1 ((Wang et al., 2018)). Dt, Vt can be obtained

Online Convolutional Sparse Coding with Sample-Dependent Dictionary

by solving the optimization problem:

D(p, :)Ht(:, :, p) D (:, p)

2 D(p, :)Gt(:, p) (4)

s.t. F(V (:, k)) = D(:, k), k = 1, . . . , K,

k C(V (:, k))k2

2 1, k = 1, . . . , K,

where Ht(:, :, p)= 1

i (:, p) Z?

i (p, :)2RK K, and Gt(:, p)= 1

i (:, p)2RK.

Problem (4) can be solved by ADMM. The total space complexity is only O(K2P), which is independent of N. Moreover, Ht and Gt can be updated incrementally.

Two other online CSC reformulations have also been proposed recently. Degraux et al. (2017) performs convolution in the spatial domain, and is slow. Liu et al. (2017) performs convolution in the frequency domain, but requires expensive huge sparse matrix operations.

3. Online CSC with Sample-Dependent

Though OCSC scales well with sample size N, its space complexity still depends on K quadratically. This limits the number of ﬁlters that can be used and can impact performance. Motivated by the ideas of separable ﬁlters in Section 2.1, we enable learning with more ﬁlters by approximating the K ﬁlters with R base ﬁlters, where R K. In contrast to the separable ﬁlters, which are obtained by post-processing and may not be optimal, we propose to learn the dictionary directly during signal reconstruction. Moreover, ﬁlters in the dictionary are combined from the base ﬁlters in a sample-dependent manner.

3.1. Problem Formulation

Recall that each xi in (1) is represented by PK

k=1 D(:, k) Zi(:, k). Let B 2 RM R, with columns {B(:, r)} being the base ﬁlters. We propose to represent xi as:

Wi(r, k)B(:, r)

Zi(:, k), (5)

where Wi 2 RR K is the matrix for the ﬁlter combination weights. In other words, each D(:, k) in (1) is replaced by PR

r=1 Wi(r, k)B(:, r), or equivalently,

Di = BWi, (6)

which is sample-dependent. As will be seen, this allows the Wi s to be learned independently (Section 3.3). This also leads to more sample-dependent patterns being captured and thus better performance (Section 4.4).

Sample-dependent ﬁlters have been recently studied in convolutional neural networks (CNN) (Jia et al., 2016). Empirically, this outperforms standard CNNs in one-shot learning (Bertinetto et al., 2016), video prediction (Jia et al., 2016) and image deblurring (Kang et al., 2017). Jia et al. (2016) uses a specially designed neural network to learn the ﬁlters, and does not consider the CSC model. In contrast, the sample-dependent ﬁlters here are integrated into CSC.

The dictionary can also be adapted to individual samples by ﬁne-tuning (Donahue et al., 2014). However, learning the initial shared dictionary is still expensive when K is large. Besides, as will be shown in Section 4.2, the proposed method outperforms ﬁne-tuning empirically.

3.2. Learning

Plugging (6) into the CSC formulation in (1), we obtain

min B,{Wi,Zi}

fi(B, Wi, Zi) + βk Zik1 (7)

s.t. BWi 2 D, i = 1, . . . , N, (8)

fi(B, Wi, Zi) 1

Wi(r, k)B(:, r)

and B {B : k B(:, r)k2 1, 8 r = 1, . . . , R}. As B and Wi are coupled together in (8), this makes the optimization problem difﬁcult. The following Proposition decouples B and Wi. All the proofs are in the Appendix.

Proposition 2. For B 2 B, we have BWi 2 D if (i) Wi 2 W 1 {W : k Wi(:, k)k1 1, k = 1, . . . , K}, or (ii) Wi 2 W 2 {W : k Wi(:, k)k2 1/

R, k = 1, . . . , K}.

To simplify notations, we use W to denote W 1 or W 2. By imposing either one of the above structures on {Wi}, we have the following optimization problem:

min B,{Wi,Zi}

fi(B, Wi, Zi) + βk Zik1 (10)

s.t. B 2 B, and Wi 2 W, i = 1, . . . , N.

On inference with sample xj, the corresponding (Wj, Zj) can be obtained by solving (10) with the learned B ﬁxed.

Online Convolutional Sparse Coding with Sample-Dependent Dictionary

3.3. Online Learning Algorithm for (10)

As in Section 2.2, we propose an online algorithm for better scalability. At the tth iteration, consider

min B,{Wi,Zi}

fi(B, Wi, Zi) + βk Zik1 (11)

s.t. B 2 B, and Wi 2 W, i = 1, . . . , N.

Let B(:, r) F(B(:, r)), where B(:, r) 2 RM is zeropadded to be P-dimensional. Note that the number of convolutions can be reduced from K to R by rewriting the summation above as PR

r=1 B(:, r) (PK

k=1 Wi(r, k)Zi(: , k)). The following Proposition rewrites (11) and performs convolutions in the frequency domain. Proposition 3. Problem (11) can be rewritten as

min B,{Wi,Zi}

fi( B, Wi, Zi)+βk Zik1 (12)

s.t. k C(F 1( B(:, r)))k2 1, r = 1, . . . , R,

Wi 2 W, i = 1, . . . , N,

where fi( B, Wi, Zi) 1 2P k xi PR

r=1 B(:, r) Yi(:, r)k2

2, and Yi(:, r) F(Zi W >

The spatial-domain base ﬁlters can be recovered from B as B(:, r) = C(F 1( B(:, r))).

3.3.1. OBTAINING Bt

From (12), Bt can be obtained by solving the subproblem:

B(:, r) Yi(:, r)

2 s.t. F( V (:, r)) = B(:, r), r = 1, . . . , R,

k C( V (:, r))k2 1, r = 1, . . . , R,

where V is an auxiliary variable. This is of the same form as (3). Hence, analogous to (4), Bt can be obtained as:

B(p, :) Ht(:, :, p) B (:, p)

2 B(p, :) Gt(:, p) (13)

s.t. F( V (:, r)) = B(:, r), r = 1, . . . , R, k C( V (:, r))k2 1, r = 1, . . . , R,

where Ht(:, :, p)=1

i (:, p) Y ?

i (p, :) 2 RR R, and Gt(:, p) = 1 t

i (:, p) 2 RR. They can be incrementally updated as

Ht(:, :, p) = t 1

Ht 1(:, :, p)+ 1

t (:, p) Y ?

t (p, :), (14)

Gt(:, p) = t 1

Gt 1(:, p) + 1

t (:, p). (15)

Problem (13) can then be solved using ADMM as in (4).

3.3.2. OBTAINING Wt AND Zt

With the arrival of xt, we ﬁx the base ﬁlters to Bt 1 learned at the last iteration, and obtain (Wt, Zt) from (12) as:

W,Z F(W, Z) ft( Bt 1, W, Z)+IW(W)+βk Zk1, (16)

where IW(W) is the indicator function on W (i.e., IW(W) = 0 if W 2 W and 1 otherwise).

As in the CSC literature, it can be shown that ADMM can also be used to solve (16). While CSC s code update subproblem in (2) is convex, problem (16) is nonconvex and existing convergence results for ADMM (Wang et al., 2015) do not apply.

In this paper, we will instead use the nonconvex and inexact accelerated proximal gradient (ni APG) algorithm (Yao et al., 2017). This is a recent proximal algorithm for nonconvex problems. As the regularizers on W and Z in (16) are independent, the proximal step w.r.t. the two blocks can be performed separately as: (prox IW( )(W), proxβk k1(Z)) (Parikh & Boyd, 2014). As shown in (Parikh & Boyd, 2014), these individual proximal steps can be easily computed (for W = W 1 or W 2).

3.3.3. COMPLETE ALGORITHM

The whole procedure, which will be called Sampledependent Convolutional Sparse Coding (SCSC) , is shown in Algorithm 1. Its space complexity, which is dominated by Ht and Gt, is O(R2P). Its per-iteration time complexity is O(RKP + RP log P), where the O(RKP) term is due to gradient computation, and O(RP log P) is due to FFT/inverse FFT. Table 1 compares its complexities with those of the other online and distributed CSC algorithms. As can be seen, SCSC has much lower time and space complexities as R K.

4. Experiments

Experiments are performed on a number of data sets (Table 2). Fruit and City are two small image data sets that have been commonly used in the CSC literature (Zeiler et al., 2010; Bristow et al., 2013; Heide et al., 2015; Papyan et al., 2017). We use the default training and testing splits provided in (Bristow et al., 2013). The images are preprocessed as in (Zeiler et al., 2010; Heide et al., 2015; Wang et al., 2018), which includes conversion to grayscale, feature standardization, local contrast normalization and edge tapering. These two data sets are small. In some experiments, we will also use two larger data sets, CIFAR10 (Krizhevsky & Hinton, 2009) and Flower (Nilsback &

Zisserman, 2008). Following (Heide et al., 2015; Choudhury et al., 2017; Papyan et al., 2017; Wang et al., 2018), we set the ﬁlter size M as 11 11, and the regularization parameter

Online Convolutional Sparse Coding with Sample-Dependent Dictionary

Table 1. Comparing the proposed SCSC algorithm with other scalable CSC algorithms on per iteration cost. For CCSC, the cost is

measured per machine, and L is the number of machines in the distributed system. Usually, N K R, and P M.

space code update time ﬁlter update time OCSC (Wang et al., 2018) O(K2P) O(KP + KP log P) O(K2P + KP log P) OCDL-Degraux (Degraux et al., 2017) O(K2M 2 + KPM) O(K2P 3) O(K2PM 2 + KPM) OCDL-Liu (Liu et al., 2017) O(K2P) O(KP + KP log P) O(K2P + KP log P) CCSC (Choudhury et al., 2017) O( NKP

L + KP) O( NKP

L + NKP log( P

L )) O( NK2P

L )) SCSC O(R2P) O(RKP + RP log P) O(R2P + RP log P)

Algorithm 1 Sample-dependent CSC (SCSC).

1: Initialize W0 2 W, B0 2 B, H0 = 0, G0 = 0; 2: for t = 1, 2, . . . , T do 3: draw xt from {xi}; 4: xt = F(xt); 5: obtain Wt, Zt using ni APG; 6: for r = 1, 2, . . . , R do 7: Yt(:, r) = F(Zt W >

t (:, r)); 8: end for 9: update { Ht(:, :, 1), . . . , Ht(:, :, P)} using (14); 10: update { Gt(:, 1), . . . , Gt(:, P)} using (15); 11: update Bt by (13) using ADMM; 12: end for 13: for r = 1, 2, . . . , R do 14: BT (:, r) = C(F 1( BT (:, r))); 15: end for output BT .

β in (1) as 1.

Table 2. Summary of the image data sets used.

size #training #testing Fruit 100 100 10 4 City 100 100 10 4 CIFAR-10 32 32 50,000 10,000 Flower 500 500 2,040 6,149

To evaluate efﬁcacy of the learned dictionary, we will mainly consider the task of image reconstruction as in (Aharon et al., 2006; Heide et al., 2015; Sironi et al., 2015). The reconstructed image quality is evaluated by the testing peak signal-to-noise ratio (Papyan et al., 2017): PSNR = 1 | |

xj2 20 log10

P kˆxj xjk2

, where ˆxj is the reconstruction of xj from test set . The experiment is repeated ﬁve times with different dictionary initializations.

4.1. Choice of W : W 1 versus W 2

First, we study the choice of W in Proposition 2. We compare SCSC-L1, which uses W = W 1, with SCSCL2, which uses W = W 2. Experiments are performed on Fruit and City. As in (Heide et al., 2015; Papyan et al., 2017; Wang et al., 2018), the number of ﬁlters K is set to

100. Recall the space complexity results in Table 1, we

deﬁne the compression ratio of SCSC relative to OCSC (using the same K) as CR = (K/R)2. We vary R in {K/20, K/10, K/9, . . . , K/2, K}. The corresponding CR is {400, 100, 81, . . . , 1}.

Results are shown in Figure 1. As can be seen, SCSC-L1 is much inferior. Figure 2(a) shows the weight Wj obtained with K = 100 and R = 10 by SCSC-L1 on a test sample xj from City (results on the other data sets are similar). As can be seen, most of its entries are zero because of the sparsity induced by the 1 norm. The expressive power is severely limited as typically only one base ﬁlter is used to approximate the original ﬁlter. On the other hand, the Wj learned by SCSC-L2 is dense and has more nonzero entries (Figure 2(b)). In the sequel, we will only focus on SCSC-L2, which will be simply denoted as SCSC.

Figure 1. Testing PSNR s of SCSC-L1 and SCSC-L2 at different

R s on the Fruit and City data sets.

(a) SCSC-L1.

(b) SCSC-L2.

Figure 2. Weight matrices obtained on a test sample from City.

Each column corresponds to an original ﬁlter.

4.2. Sample-Dependent Dictionary

In this experiment, we set K = 100, and compare SCSC with the following algorithms that use sample-independent dictionaries: (i) SCSC (shared): This is a SCSC variant in which all Wi s in (5) are the same. Its optimization is based on alternating minimization. (ii) Separable ﬁlters learned by tensor decomposition (SEP-TD) (Sironi et al., 2015), which

Online Convolutional Sparse Coding with Sample-Dependent Dictionary

is based on post-processing the (shared) dictionary learned by OCSC as reviewed in Section 2.1; (iii) OCSC (Wang et al., 2018): the state-of-the-art online CSC algorithm.

Results are shown in Figure 3. As can be seen, SCSC always outperforms SCSC(shared) and SEP-TD, and outperforms OCSC when R = 10 (corresponding to CR = 100) or above. This demonstrates the advantage of using a sampledependent dictionary.

Figure 3. Testing PSNR vs R for CSC algorithms using shared and

sample-dependent dictionaries.

Next, we compare against OCSC with ﬁne-tuned ﬁlters, which are also sample-dependent. Speciﬁcally, given test sample xj, we ﬁrst obtain its code Zj from (2) with the learned dictionary D, and then ﬁne-tune D by solving (3) using the newly computed Zj. As in (Donahue et al., 2014), this is repeated for a few iterations.1 We set OCSC s K to be equal to SCSC s R, so that the two methods take the same space (Table 1). The K used in SCSC is still 100. Results are shown in Figure 4. As can be seen, though ﬁne-tuning improves the performance of OCSC slightly, this approach of generating sample-dependent ﬁlters is still much worse than SCSC.

Figure 4. Comparison of SCSC and OCSC with ﬁne-tuning.

4.3. Learning with More Filters

Recall that SCSC allows the use of more ﬁlters (i.e., a larger K) because of its lower time and space complexities. In this Section, we demonstrate that this can lead to better performance. We compare SCSC with two most recent batch and online CSC methods, namely, slice-based CSC

1In the experiments, we stop after ﬁve iterations.

(SBCSC) (Papyan et al., 2017) and OCSC. For SCSC, we set R = 10 for Fruit and City, and R = 30 for CIFAR-10 and Flower.

Figure 5 shows the testing PSNR s at different K s. As can be seen, a larger K consistently leads to better performance for all methods. SCSC allows the use of a larger K because of its much smaller memory footprint. For example, on CIFAR-10, CR = 1024 at K = 800; on Flower, CR = 1600 at K = 400.

(c) CIFAR-10.

(d) Flower.

Figure 5. Effect of K on the testing PSNR. Note that SBCSC

cannot be run on CIFAR-10 and Flower, which are large. For OCSC, it can only run up to K = 50 on Flower.

4.4. Comparison with the State-of-the-Art

First, we perform experiments on the two smaller data sets of Fruit and City, with K = 100. We set R = 10 (i.e., CR = 100) for SCSC. This is compared with the batch CSC algorithms, including (i) deconvolution network (Deconv Net) (Zeiler et al., 2010), (ii) fast CSC (FCSC) (Bristow et al., 2013), (iii) fast and ﬂexible CSC (FFCSC) (Heide et al., 2015), (iv) convolutional basis pursuit denoising (CBPDN) (Wohlberg, 2016), (v) the CONSENSUS algorithm (ˇSorel & ˇSroubek, 2016), and (vi) slice-based CSC (SBCSC) (Papyan et al., 2017). We also compare with the online CSC algorithms, including (vii) OCSC (Wang et al., 2018), (viii) OCDL-Degraux (Degraux et al., 2017), and (ix) OCDL-Liu (Liu et al., 2017).

Figure 6 shows convergence of the testing PSNR with clock time. As also demonstrated in (Degraux et al., 2017; Liu et al., 2017; Wang et al., 2018), online CSC methods converge faster and have better PSNR than batch CSC methods. Among the online methods, SCSC has comparable PSNR as OCSC, but is faster and requires much less storage (CR = 100).

Online Convolutional Sparse Coding with Sample-Dependent Dictionary

Figure 6. Testing PSNR on the small data sets.

Next, we perform experiments on the two large data sets, CIFAR-10 and Flower. All the batch CSC algorithms and two online CSC algorithms, OCDL-Degraux and OCDLLiu, cannot handle such large data sets. Hence, we will only compare SCSC with OCSC. On CIFAR-10, we set K = 300, and the corresponding CR for SCSC is 100. On Flower, K is still 300 for SCSC. However, OCSC can only

use K = 50 because of its much larger memory footprint. Figure 7 shows convergence of the testing PSNR. In both cases, SCSC signiﬁcantly outperforms OCSC.

(a) CIFAR-10.

(b) Flower.

Figure 7. Testing PSNR on the large data sets.

4.5. Higher-Dimensional Data

In this section, we perform experiments on data sets with dimensionalities larger than two. To alleviate the large memory problem, Choudhury et al. (2017) proposed the use of distributed algorithms. Here, we show that SCSC can effectively handle these data sets using one single machine.

Experiments are performed on three data sets (Table 3) in (Choudhury et al., 2017). The Video data set contains image subsequences recorded in an airport (Li et al., 2004). The length of each video is 7, and each image frame is of size 100 100. The Multispectral data contains 60 60 patches from multispectral images (covering 31 wavelengths) of real-world objects and materials (Yasuma et al., 2010). The Light ﬁeld data contains 60 60 patches of light ﬁeld images on objects and scenes (Kalantari et al., 2016). For each pixel, the light rays are from 8 8 different directions. Following (Choudhury et al., 2017), we set the ﬁlter size M

to 11 11 11 for Video, 11 11 31 for Multispectral, and 11 11 8 8 for Light ﬁeld.

Table 3. Summary of the higher-dimensional data sets used.

size #training #testing Video 100 100 7 573 143 Multispectral 60 60 31 2,200 1,000 Light ﬁeld 60 60 8 8 7,700 385

We compare SCSC with OCSC and the concensus CSC (CCSC) (Choudhury et al., 2017) algorithms, with K = 50. For fair comparison, only one machine is used for all methods. We do not compare with the batch methods and the two online methods (OCDL-Degraux and OCDL-Liu) as they are not scalable (as already shown in Section 4.4).

Because of the small memory footprint of SCSC, we run it on a GTX 1080 Ti GPU in this experiment. OCSC is also run on GPU for Video. However, OCSC can only run on CPU for Multispectral and Light ﬁeld. CCSC, which needs to access all the samples and codes during processing, can only be on CPU.2

Results are shown in Table 4. Note that SCSC is the only method that can handle the whole of Video, Multispectral and Light ﬁeld data sets on a single machine. In comparison, CCSC can only handle a maximum of 30 Video samples, 40 Multispectral samples, and 35 Light ﬁeld samples. OCSC can handle the whole of Video and Multispectral, but cannot converge in 2 days when the whole Light ﬁeld data set is used. Again, SCSC outperforms OCSC and CCSC.

As for speed, SCSC is the fastest. However, note that this is for reference only as SCSC is run on GPU while the others (except for OCSC on Video) are run on CPU. Nevertheless, this still demonstrates an important advantage of SCSC, namely that its small memory footprint can beneﬁt from the use of GPU, while the others cannot.

4.6. Image Denoising and Inpainting

In previous experiments, superiority of the learned dictionary is demonstrated by reconstruction of clean images. In this section, we further examine the learned dictionary on two applications: image denoising and inpainting. Ten test images provided by (Choudhury et al., 2017) are used. In denoising, we add Gaussian noise with zero mean and variance 0.01 to the test images (the average input PSNR is 10d B). In inpainting, we random sub-sample 50% of the pixels as 0 (the average input PSNR is 9.12d B). Following (Heide et al., 2015; Choudhury et al., 2017; Papyan et al.,

2For Video, the memory used (in GB) by CCSC, OCSC, SCSC (with R = 5) and SCSC (with R = 10) are 28.73, 7.58, 2.66, and 2.87, respectively. On Multispectral, they are 28.26, 11.09, 0.73 and 0.76; on Light ﬁeld, they are 29.79, 15.94, 7.26 and 8.88, respectively.

Online Convolutional Sparse Coding with Sample-Dependent Dictionary

Table 4. Results on the higher-dimensional data sets. PSNR is in d B and clock time is in hours. Timing results based on GPU are marked with asterisks.

Video Multispectral Light ﬁeld PSNR time PSNR time PSNR time CCSC 20.43 0.11 11.91 0.07 17.67 0.14 27.88 0.07 13.70 0.09 8.99 0.11 OCSC 33.17 0.01 1.41 0.04* 30.12 0.02 31.19 0.02 - -

SCSC R = 5 35.30 0.02 0.73 0.02* 30.51 0.02 1.21 0.03* 29.30 0.03 11.12 0.07* R = 10 38.02 0.03 0.81 0.01* 31.71 0.01 1.40 0.01* 31.70 0.02 17.97 0.05*

Table 5. Testing PSNR (d B) on image denoising and inpainting.

denoising inpainting SBCSC OCSC SCSC SBCSC OCSC SCSC Wind Mill 14.88 0.03 16.20 0.03 17.27 0.02 29.76 0.13 29.40 0.14 29.76 0.08 Sea Rock 14.80 0.02 16.01 0.02 17.10 0.02 24.92 0.06 25.04 0.04 25.17 0.04 Parthenon 14.97 0.02 16.33 0.01 17.44 0.03 27.06 0.06 26.79 0.04 28.04 0.04 Rolls Royce 15.23 0.01 16.27 0.01 17.63 0.02 24.96 0.13 24.66 0.10 25.06 0.05 Fence 15.21 0.04 16.53 0.02 17.56 0.03 26.81 0.05 26.71 0.08 26.85 0.05 Car 16.90 0.01 18.05 0.03 20.06 0.05 29.60 0.07 29.40 0.09 30.44 0.04 Kid 14.90 0.01 16.21 0.02 17.22 0.03 25.36 0.01 25.42 0.07 25.67 0.07 Tower 14.89 0.02 16.19 0.01 18.36 0.05 26.64 0.04 26.48 0.06 26.96 0.03 Fish 16.40 0.01 17.40 0.01 18.61 0.02 27.49 0.03 26.98 0.08 27.23 0.07 Food 16.38 0.01 17.68 0.02 18.56 0.03 29.96 0.05 29.62 0.08 31.49 0.02

2017), we use a binary weight matrix to mask out positions of the missing pixels. We use the ﬁlters learned from Fruit in Section 4.4. SCSC is compared with (batch) SBCSC and (online) OCSC.

Results are shown in Table 5. As can be seen, the PSNRs obtained by SCSC are consistently higher than those by the other methods. This shows that the dictionary, which yields high PSNR on image reconstruction, also leads to better performance in other image processing applications.

4.7. Solving (16): ni APG vs ADMM

Finally, we compare the performance of ADMM and ni APG in solving subproblem (16). We use a training sample xi from City. The experiment is repeated ﬁve times with different (Wi, Zi) initializations. Figure 8 shows convergence of the objective in (16) with time. As can be seen, ni APG has fast convergence while ADMM fails to converge. Figure 9 shows k Yi F(Zi W >

F , which measures violation of the ADMM constraints, with the number of iterations. As can be seen, the violation does not go to zero, which indicates that ADMM does not converge.

5. Conclusion

In this paper, we proposed a novel CSC extension, in which each sample has its own sample-dependent dictionary constructed from a small set of shared base ﬁlters. Using online learning, the model can be efﬁciently updated with low time and space complexities. Extensive experiments on a variety of data sets including large image data sets and

Figure 8. Convergence of ni APG and ADMM on solving (16).

Figure 9. Constraint violation in ADMM.

higher-dimensional data sets all demonstrate its efﬁciency and scalability.

Acknowledgements

The second author especially thanks Weiwei Tu and Yuqiang Chen from 4Paradigm Inc. This research was supported in part by the Research Grants Council, Hong Kong, under Grant 614513, and by the University of Macau Grant SRG2015-00050-FST.

Online Convolutional Sparse Coding with Sample-Dependent Dictionary

Aharon, M., Elad, M., and Bruckstein, A. K-SVD: An

algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311 4322, 2006.

Andilla, F. and Hamprecht, F. Sparse space-time deconvolution for calcium image analysis. In Advances in Neural Information Processing Systems, pp. 64 72, 2014.

Bertinetto, L., Henriques, J. F., Valmadre, J., Torr, P., and

Vedaldi, A. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, pp.

523 531, 2016.

Bertsekas, D. and Tsitsiklis, J. Parallel and Distributed

Computation: Numerical Methods. Athena Scientiﬁc, 1997.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J.

Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2011.

Bristow, H., Eriksson, A., and Lucey, S. Fast convolutional

sparse coding. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 391 398, 2013.

Chang, H., Han, J., Zhong, C., Snijders, A., and Mao,

J. Unsupervised transfer learning via multi-scale convolutional sparse coding for biomedical applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

Choudhury, B., Swanson, R., Heide, F., Wetzstein, G., and

Heidrich, W. Consensus convolutional sparse coding. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4280 4288, 2017.

Cogliati, A., Duan, Z., and Wohlberg, B. Context-dependent

piano music transcription with convolutional sparse coding. IEEE/ACM Transactions on Audio Speech and Language Processing, 24(12):2218 2230, 2016.

Degraux, K., Kamilov, U. S., Boufounos, P. T., and Liu, D.

Online convolutional dictionary learning for multimodal imaging. In IEEE International Conference on Image Processing, pp. 1617 1621, 2017.

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N.,

Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, pp. 647 655, 2014.

Gu, S., Zuo, W., Xie, Q., Meng, D., Feng, X., and Zhang, L.

Convolutional sparse coding for image super-resolution. In International Conference on Computer Vision, pp. 1823 1831, 2015.

Heide, F., Heidrich, W., and Wetzstein, G. Fast and ﬂexible

convolutional sparse coding. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5135 5143, 2015.

Jas, M., La Tour, T. D., Simsekli, U., and Gramfort, A.

Learning the morphology of brain signals using alphastable convolutional sparse coding. In Advances in Neural Information Processing Systems, pp. 1099 1108, 2017.

Jia, X., De Brabandere, B., Tuytelaars, T., and Gool,

L. V. Dynamic ﬁlter networks. In Advances in Neural Information Processing Systems, pp. 667 675, 2016.

Kalantari, N. K., Wang, T., and Ramamoorthi, R. Learning-

based view synthesis for light ﬁeld cameras. ACM Transactions on Graphics, 35(6):193, 2016.

Kang, D., Dhar, D., and Chan, A. Incorporating side information by adaptive convolution. In Advances in Neural Information Processing Systems, pp. 3870 3880, 2017.

Kavukcuoglu, K., Sermanet, P., Boureau, Y., Gregor, K.,

Mathieu, M., and Le Cun, Y. Learning convolutional feature hierarchies for visual recognition. In Advances in Neural Information Processing Systems, pp. 1090 1098, 2010.

Krizhevsky, A. and Hinton, G. Learning multiple layers of

features from tiny images. Technical report, University of Toronto, 2009.

Li, L., Huang, W., Gu, I. Y., and Tian, Q. Statistical modeling of complex backgrounds for foreground object detection. IEEE Transactions on Image Processing, 13 (11):1459 1472, 2004.

Liu, J., Garcia-Cardona, C., Wohlberg, B., and Yin, W.

Online convolutional dictionary learning. In IEEE International Conference on Image Processing, pp. 1707 1711, 2017.

Mallat, S. A Wavelet Tour of Signal Processing. Academic

Press, 1999.

Nilsback, M. and Zisserman, A. Automated ﬂower classiﬁcation over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722 729, 2008.

Pachitariu, M., Packer, A., Pettit, N., Dalgleish, H., Hausser,

M., and Sahani, M. Extracting regions of interest from

Online Convolutional Sparse Coding with Sample-Dependent Dictionary

biological images with convolutional sparse block coding. In Advances in Neural Information Processing Systems, pp. 1745 1753, 2013.

Papyan, V., Romano, Y., Sulam, J., and Elad, M.

Convolutional dictionary learning via local processing. In International Conference on Computer Vision, pp. 5296 5304, 2017.

Parikh, N. and Boyd, S. Proximal algorithms. Foundations

and Trends in Optimization, 1(3):127 239, 2014.

Peter, S., Kirschbaum, E., Both, M., Campbell, L., Harvey,

B., Heins, C., Durstewitz, D., Diego, F., and Hamprecht, F. A. Sparse convolutional coding for neuronal assembly detection. In Advances in Neural Information Processing Systems, pp. 3678 3688, 2017.

Rigamonti, R., Sironi, A., Lepetit, V., and Fua, P. Learning

separable ﬁlters. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2754 2761, 2013.

Sironi, A., Tekin, B., Rigamonti, R., Lepetit, V., and Fua, P.

Learning separable ﬁlters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1):94 106, 2015.

ˇSorel, M. and ˇSroubek, F. Fast convolutional sparse coding

using matrix inversion lemma. Digital Signal Processing, 55:44 51, 2016.

Wang, Y., Yin, W., and Zeng, J. Global convergence of

ADMM in nonconvex nonsmooth optimization. ar Xiv preprint ar Xiv:1511.06324, 2015.

Wang, Y., Yao, Q., Kwok, J. T., and Ni, L. M. Scalable

online convolutional sparse coding. IEEE Transactions on Image Processing, 2018.

Wohlberg, B. Efﬁcient algorithms for convolutional sparse

representations. IEEE Transactions on Image Processing, 25(1):301 315, 2016.

Yao, Q., Kwok, J., Gao, F., Chen, W., and Liu, T.-

Y. Efﬁcient inexact proximal gradient algorithm for nonconvex problems. In International Joint Conferences on Artiﬁcal Intelligence, pp. 3308 3314, 2017.

Yasuma, F., Mitsunaga, T., Iso, D., and Nayar, S. K.

Generalized assorted pixel camera: postcapture control of resolution, dynamic range, and spectrum. IEEE Transactions on Image Processing, 19(9):2241 2253,

Zeiler, M., Krishnan, D., Taylor, G., and Fergus, R.

Deconvolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2528 2535, 2010.