# learning_by_stretching_deep_networks__6a6b06fb.pdf

Learning by Stretching Deep Networks

Gaurav Pandey GP88@CSA.IISC.ERNET.IN Ambedkar Dukkipati AD@CSA.IISC.ERNET.IN Department of Computer Science and Automation Indian Institute of Science, Bangalore-560012, India

In recent years, deep architectures have gained a lot of prominence for learning complex AI tasks because of their capability to incorporate complex variations in data within the model. However, these models often need to be trained for a long time in order to obtain good results. In this paper, we propose a technique, called stretching , that allows the same models to perform considerably better with very little training. We show that learning can be done tractably, even when the weight matrix is stretched to inﬁnity, for some speciﬁc models. We also study tractable algorithms for implementing stretching in deep convolutional architectures in an iterative manner and derive bounds for its convergence. Our experimental results suggest that the proposed stretched deep convolutional networks are capable of achieving good performance for many object recognition tasks. More importantly, for a ﬁxed network architecture, one can achieve much better accuracy using stretching rather than learning the weights using backpropagation.

1. Introduction

In recent years, there has been a growing amount of research to map the raw representation of input (primarily, images and speech signals) to a feature representation suitable for classiﬁcation. This has led many researchers to wonder about whether or not the representation learning stage should be kept independent of the classiﬁcation stage. Most feature extraction algorithms, such as SIFT (Lowe, 2004) and SURF (Bay et al., 2006) keep the two stages separate and have been shown to obtain good performance for object recognition tasks. Furthermore, many feature learning algorithms introduced in the

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

past decade, such as sparse coding (Lee et al., 2006), restricted Boltzmann machines (Hinton, 2002), deep Boltzmann machines (Salakhutdinov & Hinton, 2009), autoencoders (Bengio et al., 2007), k means (Coates et al., 2011), also learn features independently of the classiﬁcation stage. Recent results have shown that using one or more layers of these feature learning algorithms is sufﬁcient to obtain state of the art results for many object recognition tasks (Bo et al., 2012; Coates et al., 2011).

On the other end of the spectrum, there are learning models that support the coupling of the two stages. Neural networks form the primary examples of such models. While neural networks were introduced three decades back, the lack of a proper way to initialize their weights and the extensive time and memory required for training these models, have kept these models dormant for a long time. In recent years, the development of unsupervised feature learning algorithms, new weight-initialization methods and inexpensive GPUs have brought back neural networks into the main stream of machine learning research. In fact, many state of the art results for object recognition tasks have been obtained using deep convolutional neural networks (Ciresan et al., 2012; Zeiler & Fergus, 2013b).

The reason behind the exceptional performance of neural networks as compared to other classiﬁers, is that, unlike other classiﬁers, representational learning is already embedded into the neural network architecture. Hence, if we can provide a proper feature representation for other classiﬁers, they may be able to perform as well as neural networks, as has been shown in many recent results (Bo et al., 2012; Coates et al., 2011; Pandey & Dukkipati, 2014). Towards this end, we introduce a method for computing kernel (equivalent to representational learning), which can then be fed to a kernel-based classiﬁer. The method for computing the kernel involves stretching the weight matrix learnt by an arbitrary feature learning algorithm.

Contributions: We introduce a technique called stretching , that uses minimal amount of learning to achieve reasonable performance for classiﬁcation tasks. The technique takes a feature learning algorithm as the input and stretches

Learning by Stretching Deep Networks

the weight matrix learnt by the algorithm to generate new weight vectors.

1. We show that when the weight matrix has been stretched to inﬁnity, learning can be done tractably with the help of kernels for a special case. The result utilizes the derivation of arc-cosine kernels given by Cho & Saul (2010). We refer to the corresponding architecture as stretched deep network (SDN). In our experimental results, we show that for character recognition tasks, SDNs can achieve reasonable performance with minimal training.

2. In order to deal with realistic images of moderate size, we need to resort to a convolution + pooling based architecture (Le Cun et al., 1998). We show the limitations of implementing inﬁnite stretching in such an architecture.

3. We give an iterative method to compute an approximation to inﬁnite stretching. Such an architecture is referred to as convolutional stretched deep network or convolutional SDN.

4. We show that the approximate kernel matrix indeed converges to the kernel matrix corresponding to inﬁnite stretching, when the convolutional SDN has a single convolution + pooling layer.

5. In our experimental results, we show that convolutional SDNs are capable of achieving good performance for many well-known object recognition datasets.

2. Randomly Weighted Neural Networks and the Arc-cosine Kernel

It has long been known that the shallow architectures with random features are capable of achieving reasonable performance for many machine learning tasks (Amit & Geman, 1997). In (Rahimi & Recht, 2008), it was shown that training a shallow neural network by randomly choosing the non-linearities, results in a network that performs no worse than a network where the weights are optimally chosen. This was the motivation for considering randomly weighted neural networks with an inﬁnite number of neurons by Cho & Saul (2010), where they show that learning can be performed tractably for such models by using the kernel trick for many thresholding based non-linearities.

In particular, let the non-linearity under consideration be max(0, x), which is denoted by (x)+. Let x, y RD

be two input instances, and let hx and hy be their corresponding feature representations in a single-layer neural network with random weights, i.e., hx = 1

L(W Tx)+ and

L(W Ty)+, where W RD L is a Gaussian random matrix with independent entries. As L , the inner product between the feature representations converges, and is given by

k(x, y) = 1

2π x y (sin θ + (π θ) cos θ) ,

where θ is the angle between x and y (Cho & Saul, 2010). This kernel is known as the arc-cosine kernel. Feature learning can be incorporated into this kernel, by sampling the columns of W from a multivariate Gaussian distribution with 0 mean and covariance matrix Σ, where Σ has been learnt from the data. Such an approach has been considered by Pandey & Dukkipati (2014). The corresponding kernel is also known as covariance arc-cosine kernel.

3. The Notion of Stretching

In representation learning, one often ﬁxes the desired number of features to a ﬁxed ﬁnite quantity. Then, one of the possibly many unsupervised learning algorithms (for instance, RBMs, autoencoders, k means) are used to learn the features from the data. The most commonly used feature learning algorithms include restricted Boltzmann machine and autoencoders. In recent years, triangular k means (Coates et al., 2011), has also been shown to be effective for representation learning from data. Irrespective of the algorithm used, the motive of feature learning is to ﬁnd a set of weight vectors (centroids in k means) that optimize some objective. The weight vectors have the same dimension as the input data.

The idea behind the introduced concept of stretching is to take the weight vectors learnt by a representation learning model, and generate new weight vectors from the original weight vectors. One possibility is to look at linear combination of the learnt weight vectors. In our model, we assume that the scalars corresponding to these linear combinations have been sampled independently from Gaussian distribution N(0, 1). Hence, formally we deﬁne stretching for our model as follows:

Deﬁnition 3.1 (Matrix stretching). Let A RD M be a matrix whose columns denote the weight vectors learnt by some feature learning algorithm. If no feature learning algorithm is used, A can be assumed to be an identity matrix. Let W RM L, L > M be a matrix whose entries have been sampled from the standard normal distribution. Then, the stretched matrix Acst is deﬁned as

We illustrate the concept of column stretching in two dimensions in Figure 1. As can be observed readily from the

Learning by Stretching Deep Networks

15 10 5 0 5 10 15 15

30 20 10 0 10 20 30 30

30 20 10 0 10 20 30 30

Figure 1. Original weight vectors (in red) and the weight vectors obtained after stretching (in blue)

ﬁgure, the new weight vectors appear to be generated from a multivariate Gaussian distribution, whose covariance is controlled by the original points. We will make this result more formal as follows:

Theorem 3.2. Let A RD M be a ﬁxed real valued matrix and Acst RD L be the matrix obtained after applying column stretching to A. Then the columns of Acst are distributed according to a multivariate Gaussian distribution with mean 0 and covariance matrix Σ, where Σ is given by

m=1 ama T m = 1

M AAT . (1)

Here, am denote the columns of A.

4. Single Layer Stretched Network

A single layer stretched architecture comprises of a visible layer, a stretched feature layer and an output layer. Let x be an instance and A be a matrix whose columns indicate the weight vectors learnt by a feature learning algorithm. Let W RK L be a random matrix whose entries have been sampled from the standard normal distribution. The encoding of a node in the stretched feature layer is then given by

L f(x, Awj) , 1 j L , (2)

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 97.9

Number of features

Accuracy (in percentage)

Figure 2. The plot depicts the improvement in classiﬁcation performance for MNIST data, as the weight matrix learnt by an RBM for the dataset is stretched to increase the number of features. The dotted red line indicates the classiﬁcation performance achieved by an RBM without stretching.

where f is an activation function that depends upon the feature learning algorithm used. For instance, in case of Re Lu RBM, f(x, w) = (x Tw)+ and f(x, w) = σ(x Tw) in case of sigmoid RBM. As is obvious from the above expression, we have just replaced the original weight vectors by their linear combinations.

In order to evaluate the effect of stretching on the performance of the subsequent classiﬁer, we compare the unstretched model with the stretched model for Re Lu RBM for increasing values of L for MNIST dataset. The classiﬁer used at the output layer is a linear SVM. No ﬁne-tuning is performed for either of the models. In order to account for the randomness introduced by the matrix W, we average the results over 10 iterations. The weight matrix used for the stretched and the unstretched model is the same and is obtained after training the model for 5 epochs ( 150s). The results are plotted in Figure 2.

It can be observed from the ﬁgure that stretching affects the performance of the subsequent classiﬁer. Furthermore, the performance of the classiﬁer improves as the number of units in the stretched layer increase. Hence, it is particularly interesting to study the case when the number of weight vectors in the stretched matrix Acst, tend to inﬁnity. This means that Acst consists of inﬁnitely many linear combinations of the original weight vectors, where the scalars for these linear combinations are sampled independently from the standard normal distribution. In matrix notation, one can rewrite the above statement as Acst = A W, where A RD M is the original weight matrix, and W RM is the random matrix whose entries have been sampled independently from the standard normal distribution.

Learning can be made feasible for inﬁnitely stretched net-

Learning by Stretching Deep Networks

works by using the kernel trick for the special case, when the activation function is of the form f(x) = (x Tw)+. In particular, we note that learning a linear classiﬁer is equivalent to learning a kernel machine with the linear kernel. Here, the linear kernel corresponds to an inner product in the feature space, where the feature representation of an instance x is obtained by using equation (2). The next theorem allows us to compute the inner product in the feature space as L + .

Theorem 4.1. Let A be the weight matrix learnt by Re Lu RBM. Let W RK L be a random matrix whose entries have been sampled indepependently from standard normal distribution. Let x and y be two instances whose feature representations are given by

L (x TAwℓ)+, 1 ℓ L and

L (y TAwℓ)+, 1 ℓ L .

Let hx = (hℓ(x))T 1 ℓ L and hy = (hℓ(x))T 1 ℓ L. Then the inner product between the feature representation of x and y as L is given by

lim L h T xhy = 1

2π ATx ATy (sin θA+(π θA) cos θA),

(3) where θA = cos 1 x TAATy

The proof of the above theorem is given in the supplementary material. If we replace Σ = 1 M AAT in the equation (3), we get

kΣ(x, y) = lim L h T xhy

2π x Σ y Σ(sin θΣ + (π θΣ) cos θΣ) ,

where x Σ =

x TΣx and θΣ = cos 1 x TΣy

This is a scaled version of the covariance arc-cosine kernel proposed by Pandey & Dukkipati (2014).

5. Incorporating Translational Invariance in SDNs

As we show in our experimental results, inﬁnitely stretched networks allow us to obtain good performance for object recognition tasks when the images are all centred and have few pose variations, for instance, MNIST data set. However, for complex real world tasks such as scene labelling, the above technique can not be expected to perform equally well, since no prior information about the data (for instance, the information that objects can be present at any location in the model) is incorporated in the model. The usual way to incorporate translational invariance in neural

networks is to divide the image into overlapping patches, where each patch is a d d square of pixels in an image. The patches are ﬂattened from matrices to vectors and the same weight matrix is then learnt using each ﬂattened patch in either supervised or unsupervised mode. The features are then pooled over a region in the image, where the region may be ﬁxed a priori (Le Cun et al., 1998), or chosen stochastically (Zeiler & Fergus, 2013a). The above process is then repeated by treating the pooled feature maps from the previous layer as input. Such an architecture is called a convolutional neural network.

5.1. Convolutional Stretched Deep Networks

Assuming that one prefers to use average pooling and a single pooling layer, there is a straightforward method to incorporate translational invariance in inﬁnitely stretched networks. Let {p1, .., p I} be the ﬂattened patches in a region of an image x, and {q1, ..., q I} be the ﬂattened patches of the corresponding region in y. Let A be the weight matrix learnt by some feature learning algorithm. If no feature learning algorithm is used, A can be assumed to be an identity matrix. Then, the inner product between the stretched and pooled feature representation of x and y for the chosen region is given by

h T xhy = 1 LI2

i=1 (p T i AW)+

i=1 (q T i AW)+

As L , the above inner product can be rewritten as

k(x, y) = lim L h T xhy

= lim L 1 LI2

ℓ=1 (p T i Awℓ)+(q T j Awℓ)+

j=1 kΣ(pi, qj) .

Several important points can be noted about the manner in which the above kernel has been deﬁned. Firstly, the above approach can be applied only if the pooling mechanism used is sum pooling or average pooling. Currently, we are not aware of any other approach to incorporate more general pooling mechanisms exactly in inﬁnitely stretched networks. Secondly, the computation of a single entry of the pooled kernel matrix involves O(I2) kernel computations, where I is the number of patches in the region. In general, the number of patches can be as high as the number of pixels in the region. Assuming a region of size 100 100, this means that computation of one entry of the pooled kernel matrix may require 108 kernel computations, clearly a prohibitively large number.

However, the main drawback of the above architecture is not its practical infeasibility, but the inherent limitation

Learning by Stretching Deep Networks

of the model itself. In a general convolutional architecture (Le Cun et al., 1998), the ﬁrst pooling layer pools the features over a smaller region, with successive layers pooling over larger and larger regions. The possible exception is the architecture in (Coates et al., 2011), where the features are pooled over a large region in a single pooling layer. However, in the approach described above, once we obtain the kernel matrix after ﬁrst layer pooling, all information about location of pixels in the original image is lost. Hence, there is no way to apply a second layer pooling to the model. Hence, we are limited to using single layer pooling only.

5.2. Approximating Convolutional SDNs

Because of the limitations of using pooling in inﬁnitely stretched networks, we need to resort to an approximation of the model. In Algorithm 1, we describe the architecture of an approximately stretched network with a single convolution layer and a pooling layer. The architecture addressed the ﬁrst two issues associated with a convolutional SDN mentioned in Section 5.1. The issue of multiple pooling layers is addressed in Section 5.3. In order to convey the most important points of the algorithm, we have assumed that each image has a single pooling region in the ﬁrst pooling layer. It is quite straightforward to extend the model for the case where each image has multiple pooling regions in the ﬁrst pooling layer.

The proposed algorithm gives an iterative, memoryefﬁcient method of computing the pooled kernel matrix. We show below that if the algorithm is run long enough, the resulting pooled kernel matrix indeed converges to the true pooled kernel matrix, that we obtained in the previous section. Here, we have shown the result for average pooling. For max-pooling, we need to use the inequality var(max I i=1 Xi) PI i=1 var(Xi), where var(X) denotes the variance of the random variable X. Note that, one can not obtain the mean (k(x, y) in equation (6)) in closed forms for max-pooling. However, the result will still remain valid.

Theorem 5.1. The pooled kernel matrix of approximately stretched network converges in probability to the pooled kernel matrix of inﬁnitely stretched network. Furthermore, let T be the number of iterations for which Algorithm 1 has been run. Then, for ﬁxed instances x and y, with probability at least 1-δ,

|kap(x, y) k(x, y)|

j=1 Api 2 Aqj 2 , (5)

where k(x, y) is the exact value of the kernel entry and kap(x, y) is the approximate kernel value after T itera-

Algorithm 1 Iterative computation of the convolved kernel matrix Input: A set of training instances Output: The pooled kernel matrix

Initialize the kernel matrix kap to all zeros.

Learn the weight matrix A from patches of images using an unsupervised feature learning algorithm.

Repeat the following steps till convergence

1. Sample the entries of the random matrix W RD L independently from standard normal distribution 2. Compute the pooled feature representation of each instance using the random matrix W. In particular, the pooled feature representation of x is given by

i=1 (WATpi)+ ,

where pi are the patches in x. 3. Update the approximate kernel matrix by using the following equation

k(t+1) ap (x, y) = 1 t + 1(t k(t) ap (x, y) + h T xhy)

Return the kernel matrix kap.

The proof of the above theorem is give in the supplementary material. The above result implies that, if the difference between a kernel entry in approximate and exact kernel matrix after 1 iteration is ϵ with probability at least 1 δ, one can say that with the same probability, we obtain a 10 fold reduction in estimation error by scaling L ten times and repeating the experiment for ten iterations.

By using the above approximation, one can incorporate any non-linearity in the model architecture. One disadvantage of the model is that one may need to run the algorithm multiple number of times to obtain the pooled kernel matrix. However, since the entries of the random matrix W are sampled independently, it is trivial to parallelize the model on multiple nodes or adapt it to any distributed system. Furthermore, in our experiments, we found that 20 iterations were often enough to obtain good results for many object recognition tasks.

Learning by Stretching Deep Networks

Figure 3. A convolutional SDN with two pooling layers. A convolutional SDN is equivalent to a convolutional neural network with inﬁnitely many random ﬁlters in each convolutional layer. The dotted lines indicate inﬁnitely many feature maps/ﬁlters.

5.3. Extension to multiple layers

Instead of using the output of the ﬁrst pooling layer to compute the kernel matrix, one can feed it to another layer of convolution + pooling. Speciﬁcally, let x be an image of size m m and U be a tensor of size d d L1 whose entries are sampled from N(0, 1). Let u1, ..., u L1 Rd d be the sub-tensors of U obtained by ﬁxing the last dimension of U. For the sake of clarity, we assume that the matrix A learnt from ﬂattened patches of x, is an identity matrix. It is straightforward to merge the matrix A with tensor U, when it is not an identity matrix.

In the ﬁrst convolutional layer, x is convolved with u1, ..., u L1 (followed by rectiﬁcation) to obtain L1 feature maps of size (m d + 1) (m d + 1), which are then pooled using any of the possible pooling methods. Let the resultant size of each feature map be m1 m1. Let us denote the feature maps at the output of the ﬁrst pooling layer as f1, ..., f L1. Hence, fi = pool(x ui), where * denotes the 2D convolution operator and pool(.) denotes the pooling operation.

Next, the L1 feature maps at the output of the ﬁrst pooling layer are divided into L2 overlapping subsets, where each subset consists of Q feature maps. One can represent each subset as a tensor of size m1 m1 Q. Let us denote them as ˆx1, ..., ˆx L2. Let V Rd d Q L2 be a tensor whose entries have been sampled from N(0, 1). Let v1, ..., v L2 Rd d Q denote the sub-tensors of V obtained by ﬁxing the last dimension. Each subset ˆx1, ..., ˆx L2 is convolved (3-D convolution) with its corresponding subtensor to obtain L2 feature maps, which are then pooled.

We denote the resultant feature maps as g1, ...g L2, where gi = pool (ˆxi vi).

One can add further layers in a similar manner. The output of the last pooling layer is then fed to a linear kernel. This entire procedure corresponds to one single iteration of kernel computation. A pictorial representation of SDN is given in Figure 3. The entire algorithm for computing the kernel is informally described in Algorithm 2.

Algorithm 2 Iterative computation of the convolved kernel matrix for multiple stages of convolution + pooling Input: A set of training instances Output: The pooled kernel matrix

Initialize the kernel matrix kap to be all zeros.

Repeat until the kernel matrix kap converges

Sample the entries of tensors U, V etc., from N(0, 1). Use the steps described in Section 5.3 to compute the feature maps at the output of the last pooling layer for each instance. Let they be denoted as g1, ...g L2. Flatten it to get a single vector g. Update the kernel matrix as

k(t+1) ap (x, y) = 1 t + 1(t k(t) ap (x, y) + g T x gy)

Return the kernel matrix kap.

Finally, we would like to mention that in all our exper-

Learning by Stretching Deep Networks

iments using convolution SDNs, we assume an inﬁnite fully-connected random layer as the ﬁnal layer, while computing the ﬁnal kernel matrix. Let kap be the kernel matrix obtained after multiple stages of convolution + pooling and kfull be the kernel matrix after the fully connected random layer. Furthermore, let gx and gy be the feature representation of two instances x and y after the the last pooling stage.

kfull(x, y) = 1

2π gx gy (sin θ + (π θ) cos θ)

kap(x, x) q

kap(y, y)(sin θ + (π θ) cos θ) ,

where θ is the angle between gx and gy and is given by

θ = cos 1 kap(x, y) p

kap(x, x) p

Hence, the kernel matrix after the fully connected stage only depends upon the kernel matrix after the last pooling stage and hence, it can be computed after Algorithm 2 stops.

6. Experimental results

We tested the performance of stretched networks for classiﬁcation on various data sets. We have separated the results for non-convolutional stretched networks from those of convolutional SDNs. Furthermore, whenever we refer to convolutional SDNs, we mean the approximated convolutional SDN derived using Algorithm 1. The classiﬁer used in all the experiments for SDN, is kernel SVM.

6.1. Experiments results for SDNs

6.1.1. MNIST

MNIST (Le Cun et al., 1998) is a standard dataset for character recognition with 50000 training and 10000 test samples of digits ranging from 0 to 9. It is a relatively simple dataset for the task of object recognition, and gives good performance with non-pooling architectures as well. Hence, we use this dataset to test the performance of SDN against other well-known classiﬁers.

Among architectures that don t incorporate translational invariance, the best results for MNIST have been obtained by using deep Boltzmann machines (DBM). However, unsupervised pre-training of deep Boltzmann machines is very costly. Even for the MNIST dataset, pretraining a DBM with 500 units in the ﬁrst hidden layer and 1000 units in the second hidden layer, takes nearly 2 days to achieve an accuracy of 99.05% (after ﬁne-tuning with backpropagation and no dropout). In contrast, we obtained the same result by inﬁnitely stretching the weight matrix learnt by a Re Lu

Table 1. Classiﬁcation accuracies on MNIST dataset using various algorithms. Here, SDN-RBM refer to a single layer inﬁnitely stretched network, whose weight matrix has been learnt using RBM. Unsupervised training of RBM took 200 seconds while learning the parameters of kernel SVM took about 30 mins. DBN stands for a deep belief network as described by Hinton et al. (2006). The numbers in parenthesis indicate the number of hidden units in each layer.

ALGORITHM ACCURACY TIME TAKEN

SDN-RBM (1000) 99.05% < 1 HR DBN-RBM (1000) 98.35% < 1 HR DBN-RBM (1000-500-2000) 98.8% 1.5 HRS DBM (NO DROPOUT) 99.05% 2 DAYS DBM (WITH DROPOUT) 99.21% 2 DAYS

Table 2. Classiﬁcation accuracies on Caltech-101 dataset using various algorithms. The numbers in parenthesis indicate the number of hidden units in each layer. The state of the art is achieved by an 8 layered convolutional DBN with 5 convolution layers and 3 pooling layers (Zeiler & Fergus, 2013b).

ALGORITHM ACCURACY

CONV. SDN-RBM (64-256) 74.3% CONV. DBN (64-256) (NOT FINETUNED) 64.1% CONV. DBN (64-256) (FINETUNED) 65.5% CONV. DBN (8 LAYERED) 86.5%

RBM with 1000 hidden units after unsupervised training for nearly 200 seconds (6 iterations). No ﬁne-tuning is used for this model.

6.2. Experimental results for Convolutional SDNs

Next, we present the results obtained by using convolutional SDNs for standard image recognition tasks for images of moderate size.

6.2.1. CALTECH-101 DATASET

The Caltech-101 dataset (Fei-Fei et al., 2007) consists of pictures of objects belonging to 101 categories with about 40 to 800 objects per categories. For each object class, we used 30 samples for training and a maximum of 30 samples for testing. We use the same preprocessing as done by Jarrett et al. (2009). Furthermore, we average the result over 5 draws from the training set.

Effectively, we use a similar architecture as used by Jarrett et al. (2009). Three stages of feature extraction are used. The ﬁrst two stages are convolution + pooling stages,

Learning by Stretching Deep Networks

while the last stage is a fully connected layer. We randomly extract 9 9 patches from the images and learn a weight matrix with 64 weight vectors from these patches using an Re Lu RBM. The weight matrix is then stretched by multiplication with a random matrix of size 64 64. We use average pooling with a 10 10 boxcar ﬁlter and 5 5 down-sampling. The output of ﬁrst stage of feature extraction stage is 64 feature maps of size 26 26. In the second stage, we randomly combine 30 feature maps from the previous layers using 256 9 9 kernels . Again, we use average pooling in this layer with a 6 6 boxcar ﬁlter with a 4 4 down-sampling step. The output of the second layer of feature extraction is 256 feature maps of size 4 4. In both stages, we use the max(0, x) non-linearity and local contrast normalization. In the third stage, we map the output of the second stage (256 feature maps of size 4 4) to inﬁnitely many feature maps using an arc-cosine kernel. We repeat the entire process for 20 iterations to construct the ﬁnal kernel matrix as discussed in Algorithm 2.

By iterating over the above architecture for 20 iterations, we obtain an accuracy of 74.3%. We plot the change in classiﬁcation performance with the no. of iterations for Caltech-101 dataset in Figure 4 for the ﬁrst 10 iterations. The result should be compared with accuracy achieved by an unstretched architecture of the same size, which is 64.1%. The accuracy achieved by ﬁnetuning the same architecture is 65.5% as mentioned in (Jarrett et al., 2009).

1 2 3 4 5 6 7 8 9 10 65

No. of iterations

Accuracy (in percentage)

Figure 4. The plot depicts the improvement in classiﬁcation performance for Caltech-101 dataset with the respect to no. of iterations of stretching.

6.2.2. STL-10 DATASET

STL-10 dataset (Coates et al., 2011) is an image recognition dataset with 10 classes and 500 training and 800 test images per class. We use the same preprocessing as in the previous case. Again, three layers of feature extraction are used with the ﬁnal layer being a fully connected layer. As in the previous case, we randomly extract 9 9 patches from the images and learn a weight matrix with 64 weight

Table 3. Classiﬁcation accuracies on STL-10 dataset using various algorithms. The best result is obtained using the algorithm proposed in (Swersky et al., 2013)

ALGORITHM ACCURACY

CONV. SDN-RBM (64-256) 65.7% CONV. DBN (64-256) (NOT FINETUNED) 53.9% CONV. DBN (64-256) (FINETUNED) 55.3% MULTI-TASK BAYESOPT 70.1%

1 2 3 4 5 6 7 8 9 10 54

No. of iterations

Accuracy (in percentage)

Figure 5. The plot depicts the improvement in classiﬁcation performance for STL-10 dataset with the respect to no. of iterations of stretching.

vectors from these patches using an Re Lu RBM. The ﬁrst pooling layer uses an 8 8 boxcar ﬁlter with 3 3 average sampling. The second pooling layer uses 5 5 boxcar ﬁlter with 4 4 average sampling. The rest of the architecture remains the same as in the previous case. The output of the second stage is mapped to inﬁnitely many feature maps by using the arc-cosine kernel. After iterating over the above architecture for 20 iterations, we achieve an accuracy of 65.7%. The change in accuracy of an SDN with the number of iterations is given in Figure 5. The result should be compared with accuracy achieved by an unstretched architecture of the same size, which is 53.9%. The accuracy achieved by ﬁnetuning the same architecture is 55.3%.

7. Conclusion

In this paper, we proposed the technique called stretching for deep networks, and gave exact as well as approximate algorithms to compute the kernel matrix for stretched deep networks. We showed that for convolutional SDNs with a single pooling layer, the approximate kernel entry converges to the true kernel entry. Results on benchmark datasets suggests that stretching a convolutional neural network can sometimes give much better performance than ﬁne-tuning it using backpropagation.

Learning by Stretching Deep Networks

Amit, Yali and Geman, Donald. Shape quantization and recognition with randomized trees. Neural computation, 9(7):1545 1588, 1997.

Bay, Herbert, Tuytelaars, Tinne, and Gool, Luc Van. Surf: Speeded up robust features. In Proceedings of the Ninth European Conference on Computer Vision, May 2006.

Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, and Larochelle, Hugo. Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems, 19:153, 2007.

Bo, Liefeng, Ren, Xiaofeng, and Fox, Dieter. Multipath sparse coding using hierarchical matching pursuit. In NIPS Workshop on Deep Learning, 2012.

Cho, Youngmin and Saul, Lawrence K. Large-margin classiﬁcation in inﬁnite neural networks. Neural Computation, 22(10):2678 2697, 2010.

Ciresan, Dan, Meier, Ueli, and Schmidhuber, J urgen. Multi-column deep neural networks for image classiﬁcation. In Computer Vision and Pattern Recognition (CVPR), pp. 3642 3649. IEEE, 2012.

Coates, Adam, Ng, Andrew Y, and Lee, Honglak. An analysis of single-layer networks in unsupervised feature learning. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 215 223, 2011.

Fei-Fei, Li, Fergus, Rob, and Perona, Pietro. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106 (1):59 70, 2007.

Hinton, Geoffrey E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14 (8):1771 1800, 2002.

Hinton, Geoffrey E, Osindero, Simon, and Teh, Yee-Whye. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527 1554, 2006.

Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato, Marc Aurelio, and Le Cun, Yann. What is the best multi-stage architecture for object recognition? In International Conference on Computer Vision, pp. 2146 2153. IEEE, 2009.

Le Cun, Yann, Bottou, L eon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Lee, Honglak, Battle, Alexis, Raina, Rajat, and Ng, Andrew. Efﬁcient sparse coding algorithms. In Advances in Neural Information Processing Systems, pp. 801 808, 2006.

Lowe, David G. Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, 60(2):91 110, 2004.

Pandey, Gaurav and Dukkipati, Ambedkar. To go deep or wide in learning? In Seventeenth International Conference on Artiﬁcial Intelligence and Statistics, JMLR W& CP, volume 33, pp. 724 732, 2014.

Rahimi, Ali and Recht, Benjamin. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems, pp. 1313 1320, 2008.

Salakhutdinov, Ruslan and Hinton, Geoffrey E. Deep boltzmann machines. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 448 455, 2009.

Swersky, Kevin, Snoek, Jasper, and Adams, Ryan P. Multitask bayesian optimization. In Advances in Neural Information Processing Systems, pp. 2004 2012, 2013.

Zeiler, Matthew D and Fergus, Rob. Stochastic pooling for regularization of deep convolutional neural networks. ar Xiv preprint ar Xiv:1301.3557, 2013a.

Zeiler, Matthew D and Fergus, Rob. Visualizing and understanding convolutional neural networks. ar Xiv preprint ar Xiv:1311.2901, 2013b.