# network_approximation_using_tensor_sketching__04d885ed.pdf

Network Approximation using Tensor Sketching

Shiva Prasad Kasiviswanathan1 , Nina Narodytska2 and Hongxia Jin3

1 Amazon AWS AI, USA 2 VMware Research, USA 3 Samsung Research America, USA kasivisw@gmail.com, nnarodytska@vmware.com, hongxia.jin@samsung.com

Deep neural networks are powerful learning models that achieve state-of-the-art performance on many computer vision, speech, and language processing tasks. In this paper, we study a fundamental question that arises when designing deep network architectures: Given a target network architecture can we design a smaller network architecture that approximates the operation of the target network? The question is, in part, motivated by the challenge of parameter reduction (compression) in modern deep neural networks, as the ever increasing storage and memory requirements of these networks pose a problem in resource constrained environments. In this work, we focus on deep convolutional neural network architectures, and propose a novel randomized tensor sketching technique that we utilize to develop a uniﬁed framework for approximating the operation of both the convolutional and fully connected layers. By applying the sketching technique along different tensor dimensions, we design changes to the convolutional and fully connected layers that substantially reduce the number of effective parameters in a network. We show that the resulting smaller network can be trained directly and has a classiﬁcation accuracy that is comparable to the original network.

1 Introduction

Deep neural networks have become ubiquitous in machine learning with applications, ranging from computer vision, to speech recognition, and natural language processing. The recent successes of convolutional neural networks (CNNs) for computer vision applications have, in part, been enabled by recent advances in scaling up these networks, leading to networks with millions of parameters. As these networks keep growing in their number of parameters, reducing their storage and computational costs has become critical for meeting the requirements of practical applications. Because while it is

Work done while the author was at Samsung Research America.

possible to train and deploy these deep convolutional neural networks on modern clusters, their storage, memory bandwidth, and computational requirements make them prohibitive for embedded mobile applications. On the other hand, computer vision applications are growing in importance for mobile platforms. This dilemma gives rise to the following natural question: Given a target network architecture, is it possible to design a new smaller network architecture (i.e., with fewer parameters), which approximates the original (target) network architecture in its operations on all inputs? In this paper, we present an approach for answering this network approximation question using the idea of tensor sketching. Network approximation is a powerful construct because it allows one to replace the original network with the smaller one for both training and subsequent deployment [Denil et al., 2013; Chen et al., 2015; Cheng et al., 2015; Yang et al., 2015; Sindhwani et al., 2015; Chen et al., 2016; Tai et al., 2016; Garipov et al., 2016]1. That is, it completely eliminates the need for ever realizing the original network, even during the initial training phase, which is a highly desirable property when working in a storage and computation constrained environments. While approximating any network (circuit) using a smaller network (circuit) is computationally a hard problem [Umans, 1998], we study the problem of network approximation on convolutional neural networks. To approximate a convolutional neural network NN, we focus on its parametrized layers (the convolutional and fully connected layers). Consider any such layer L in the network NN. Let φ : Γ Θ Ωdenote the function (transformation) applied by this layer, where Θ represents the parameter space of the function (generally, a tensor space of some order), Γ and Ω represent the input and output space respectively. Our general idea is to replace φ by a randomized function ˆφ : Γ bΘ Ω, such that θ Θ, ˆθ bΘ, such that for every inputγ Γ, E[ˆφ(γ; ˆθ)] = φ(γ; θ), where the expectation is over randomness of the function ˆφ. In other words, ˆφ(γ; ˆθ) is an

1For clarity, we distinguish between the terms network and model: network refers to network architecture that describes the transformation applied on the input, whereas model refers to a trained network with ﬁxed parameters obtained by training a network with some training set.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

unbiased estimator of φ(γ; θ). Additionally, we establish theoretical bounds on the variance of this estimator. Ideally, we want the representation length of ˆθ to be much smaller than that of θ. For the construction of ˆφ, we introduce a novel randomized tensor sketching idea. The rough idea here is to create multiple sketches of the tensor space Θ by performing random linear projections along different dimensions of Θ, and then perform a simple combination of these sketches. This new operation ˆφ deﬁnes a new layer that approximates the functionality φ of the layer L. Since ˆφ and φ have the same input and output dimensionality, we can replace the layer L in the network NN with this new (sketch counterpart) layer. Doing so for all the convolutional and fully connected layers in NN, while maintaining the rest of the architecture, leads to a smaller network d NN, which approximates the network NN. To the best of our knowledge, ours is the ﬁrst work that uses the idea of sketching of the parameter space for the task of network approximation. The next issue is: Can we efﬁciently train the smaller network d NN? We show that, with some changes to the standard training procedure, the parameters (which now represent sketches) of the constructed smaller network can be learnt space efﬁciently on any training set. Also compared to the original network, there is also a slight improvement in the running time needed for various operations in this smaller network. This allows us to train d NN directly on D to get a reduced model d NND.2 Our experimental evaluations, on different datasets and architectures, corroborate the excellent performance of our approach by showing that it increases the limits of achievable parameter number reduction while almost preserving the original model accuracy, compared to several existing approximation techniques.

2 Preliminaries

We denote [n] = {1, . . . , n}. Vectors are in column-wise fashion, denoted by boldface letters. For a vector v, v denotes its transpose and v its Euclidean norm. For a matrix M, M F denotes its Froebnius norm. We use random matrices to create sketches of the matrices/tensors involved in the fully connected/convolutional layers. In this paper, for simplicity, we use random scaled sign (Rademacher) matrices. We note that other families of distributions such as subsampled randomized Hadamard transforms can probably lead to additional computational efﬁciency gains when used for sketching.

Deﬁnition 1. Let Z Rk d be a random sign matrix with independent entries that are +1 or 1 with probability 1/2. We deﬁne a random scaled sign matrix U = Z/

k. Here, k is a parameter that is adjustable in our algorithm. We generally assume k d. Note that E[U U] = Id where

2There memory footprint of the reduced model d NND can be further reduced using various careful operations such as pruning, binarization, quantization, low-rank decomposition, etc., [Gong et al., 2014; Han et al., 2015; Han et al., 2016; Soulié et al., 2015; Wu et al., 2015; Guo et al., 2016; Kim et al., 2016; Wang et al., 2016; Hubara et al., 2016a; Hubara et al., 2016b; Li et al., 2016; Zhu et al., 2016], which is beyond the scope of this work.

Id is the d d identity matrix. Also, by linearity of expectation, for any matrix M with d columns, we have E[MU U] = ME[U U] = M. Notations. We denote matrices by uppercase letters and higher dimensional tensors by Euler script letters. A real pth order tensor T p i=1Rdi is a member of the tensor product of Euclidean spaces Rdi for i [p]. The different dimensions of the tensor are referred to as modes. The (i1, . . . , ip)th entry of a tensor T is denoted by Ti1i2...ip. The mode-n matrix product (for n [p]) of a tensor T Rd1 dp with a matrix M Rk dn is denoted by T n M. Elementwise, we have: (T n M)i1...in 1jin+1...ip = Pdn in=1 Ti1i2...ip Mjin. A ﬁber of T is obtained by ﬁxing all but one of the indices of the tensor. A ﬂattening of tensor T along a mode (dimension) n (denoted by matn) is a matrix whose columns correspond to mode-n ﬁbers of T . Tensor Sketching. Our network approximation is based on the idea of tensor sketching. Data sketching ideas have been successfully used in designing many machine-learning algorithms, especially in the setting of streaming data, see e.g., [Woodruff, 2014]. Generally, sketching is used to construct a compact representation of the data so that certain properties in the data are (approximately) preserved. Our usage of sketching is however slightly different, instead of sketching the input data, we apply sketching on the parameters of the function. Also, we want to design sketching techniques that work uniformly for both matrices and higher order tensors. For this, we deﬁne a new tensor sketch operation. Deﬁnition 2 (Mode-n Sketch). Given a tensor, T p i=1Rdi, the mode-n sketch of T with respect to a random scaled sign matrix Un Rk dn for n [p], is deﬁned as the tensor Sn = T n Un, where n denotes the mode-n matrix product. Since, we generally pick k dn, the space needed for storing the sketch Sn is a factor dn/k smaller than that for storing T . In the case of matrices, the sketches are created by preor post-multiplying the matrix with random scaled sign matrices of appropriate dimensions. For example, given a matrix W Rd1 d2, we can construct mode-1 sketch (resp. mode-2 sketch) of W as W 1 U1 = U1W (resp. W 2 U2 = WU 2 ). Given a sketch S1 = W 1 U1 (resp. S2 = W 2 U2) of a matrix W and another matrix M Rd2 d3, it is natural to use U 1 S1M (resp. S2U2M) as an estimator for the matrix product WM. It is easy to see that both these estimators are unbiased. The second part of the following proposition analyzes the variance of these estimators. The result will motivate our construction of sketch-based layers in the next section. We omit the proof due to space limitations. Proposition 2.1. Let W Rd1 d2. Let U1 Rk d1 and U2 Rk d2 be two independent random scaled sign matrices. Let S1 = U1W(= W 1 U1) and S2 = WU 2 (= W 2 U2). Then for any matrix M Rd2 d3: 1. E[U 1 S1M] = WM, and E[S2U2M] = WM.

2. E h U 1 S1M WM 2 F

i 2d1 W M 2 F k , and

E h S2U2M WM 2 F i 2 W 2 F M 2 F k .

Notice that the variance terms decrease as 1/k. The variance bound can also be plugged into Chebyshev s inequality to get a

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

probability bound. Also, the variance bounds are quantitatively different based on whether the sketch S1 or S2 is used. In particular, depending on W and M, one of the variance bounds could be substantially smaller than the other one, e.g., if the columns in M are in the null space of W then WM is a zero matrix, so while one bound gives a tight zero variance the other one does not.

3 Sketch-based Network Architecture

We now describe our idea of approximating a network using tensor sketching.

3.1 Sketching Convolutional Layers A typical convolutional layer in a CNN transforms a 3dimensional input tensor Iin Rh1 w1 d2 into a output tensor Iout Rh2 w2 d1 by convolving Iin with the kernel tensor K Rd2 h w d1, where h2 and w2 depends on h, w, h1, w1 and possibly other parameters such as stride, spatial extent, zero padding [Goodfellow et al., 2016]. We use to denote the convolution operation, Iout = Iin K. The exact deﬁnition of the convolution operator ( ) that depends on these above mentioned additional parameters is not very important for us, and we only rely on the fact that the convolution operation can be realized using a matrix multiplication as we explain below.3 Also, a convolutional layer could be optionally followed by application of some non-linear activation function (such as Re LU or tanh), which are generally parameter free, and do not affect our construction. We use the tensor sketch operation (Deﬁnition 2) to reduce either the dimensionality of the input feature map (d2) or the output feature map (d1) in the kernel tensor K. In practice, the dimensions of the individual ﬁlters (h and w) are small integers, which we therefore do not further reduce. The motivation for sketching along different dimensions comes from our mathematical analysis of the variance bounds (Theorem 3.1), where as in Proposition 2.1 based on the relationship between Iin and K the variance could be substantially smaller in one case or the other. Another trick that works as a simple boosting technique is to utilize multiple sketches each associated with an independent random matrix. Formally, we deﬁne a SK-CONV layer as follows (see also Figure 1).

Deﬁnition 3. A SK-CONV layer is parametrized by a sequence of tensor-matrix pairs (S11, U11), . . . , (S1ℓ, U1ℓ), (S21, U21), . . . , (S2ℓ, U2ℓ) where for i [ℓ] S1i Rd2 h w k, S2i Rk h w d1 and U1i Rk d1, U2i Rkhw d2hw are independent random scaled sign matrices,4

which on input Iin Rh1 w1 d2 constructs ˆIout as follows:

i=1 Iin (S1i 4 U 1i) + Iin (S2i U 2i), (1)

3In a commonly used setting, with stride of 1 and zeropadding of 0, h2 = h1 h + 1 and w2 = w1 w + 1, and Iout R(h1 h+1) (w1 w+1) d1 is deﬁned as: Ioutxys = Ph i=1 Pw j=1 Pd2 c=1 Kcijs Iin(x+i 1)(y+j 1)c. 4We deﬁne U2i Rkhw d2hw (instead of U2i Rk d2) for simplifying the construction.

Figure 1: A SK-CONV layer with parameters (S11, U11), . . . , (S1ℓ, U1ℓ), (S21, U21), . . . , (S2ℓ, U2ℓ).

where S2i U 2i Rd2 h w d1 is deﬁned as5

(S2i U 2i)xyzs =

j=1 S2icijs U2i(cij)(xyz).

Here (S2i U 2i)xyzs is the (x, y, z, s)th entry, S2icijs is the (c, i, j, s)th entry, and U2i(cij)(xyz) is the (cij, xyz)th entry in (S2i U 2i), S2i, and U2i, respectively.

By running multiple sketches in parallel on the same input and taking the average, also results in a more stable performance across different choices of the random matrices The number of free parameters overall in all the S1i and S2i tensors put together equals ℓhwk(d1 + d2).6 Therefore, with a SKCONV layer, we get a reduction in the number of parameters compared to a traditional convolutional layer (with hwd1d2 parameters) if kℓ d1d2/(d1 + d2). With this reduction, the time for computing ˆIout, ignoring dependence on h and w, reduces from O(h2w2d1d2) (in a traditional CONV layer) to O(h2w2ℓk(d1 + d2)) (in a SK-CONV layer). Implementing a SK-CONV Layer with Matrix Multiplications. We next discuss how to implement a SK-CONV layer using just matrix multiplications. The convolution operation can be reduced into a matrix multiplication, an idea that is exploited by many deep learning frameworks [Chetlur et al., 2014]. The idea is to reformulate the kernel tensor K by ﬂattening it along the dimension representing the output feature map, which in our setting is represented along the fourth dimension of K. The input tensor Iin is used to form a matrix Iin Rh2w2 d2hw. This construction is quite standard and we refer the reader to [Chetlur et al., 2014] for more details. Then it follows that ˆIout deﬁned as Iinmat4(K) Rh2w2 d1 is a reshaping of the output tensor ˆIout (i.e., ˆIout = mat3(ˆIout)). Using this equivalence and simple algebraic observations (mat4(S1i 4 U 1i) = mat4(S1i)U1i and mat4(S2i U 2i) =

5Let Oi = S2i U 2i. The operation can be equivalently deﬁned: mat4(Oi) = U 2imat4(S2i). 6The random matrices, once picked are not changed during the training or deployment.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

U 2imat4(S2i)), we can re-express the operation in (1) as:

i=1 Iinmat4(S1i)U1i + Iin U 2imat4(S2i). (2)

Or in other words, ˆIout equals to

i=1 Iin(mat4(S1i) 2 U 1i) + Iin(mat4(S2i) 1 U 2i).

We use this matrix representation (ˆIout) of ˆIout in our implementation of a SK-CONV layer both in the forward pass and when deriving the gradients during back-propagation.

Theoretical Guarantees of a SK-CONV Layer. Given a traditional convolutional layer with kernel tensor K and independent random scaled sign matrices U11, . . . , U1ℓ, U21, . . . , U2ℓ, we can form a corresponding SK-CONV layer by constructing tensors S11, . . . , S1ℓ, S21, . . . , S2ℓsuch that mat4(S1i) = mat4(K)U 1i and mat4(S2i) = U2imat4(K) for i [ℓ]. The next theorem based on Proposition 2.1, analyzes the expectation and the variance of using these sketches as an estimator for Iout = Iin K ( Iinmat4(K)). Theorem 3.1 (Theoretical Guarantees of a SK-CONV Layer). Let K Rd2 h w d1 be a kernel tensor. Let U11, . . . , U1ℓ Rk d1 and U21, . . . , U2ℓ Rkhw d2hw be a set of independent random scaled sign matrices. Let S11, . . . , S1ℓ, S21, . . . , S2ℓbe tensors deﬁned as above. Then for any input tensor Iin Rh1 w1 d2 with Iout = Iin K:

1. Unbiased Estimation: E[ˆIout] = Iout. 2. Variance Bound:

E ˆIout Iout 2

d1 Iin K 2 F ℓk + Iin 2 F K 2 F ℓkhw .

Training a SK-CONV Layer In this section, we discuss a procedure for training a SK-CONV layer. Let Loss() denote some loss function for the network. For computational and space efﬁciency, our goal will be to perform the training without ever needing to construct the complete kernel tensor (K). We focus on deriving the gradient of the loss with respect to the parameters in a SK-CONV layer, which can then be used for back-propagating gradients. We can again exploit the equivalence between the convolution operation and matrix multiplication. Consider the operation performed in the SK-CONV layer as deﬁned in (2). Let G = Loss

ˆIout Rh2w2 d1. For i [ℓ],7

Loss mat4(S1i) = I in GU 1i 2ℓ , Loss mat4(S2i) = U2i I in G 2ℓ , and

Iin = Pℓ i=1 GU 1i mat4(S1i)

2ℓ + Pℓ i=1 G mat4(S2i) U2i

Notice that all the required operations can be carried out without ever explicitly forming the complete d2 h w d1 sized kernel tensor.

7The gradients computed with respect to mat4(S1i) and mat4(S2i) can also be converted into a tensor by reversing the mat4() operator.

3.2 Sketching Fully Connected Layers Neurons in a fully connected (FC) layer have full connections to all activations in the previous layer. These layers apply a linear transformation of the input. Let W Rd1 d2 represent a weight matrix and b Rd1 represent a bias vector. The operation of the FC layer on input hin can be described as:

a = Whin + b. (3)

Typically, the FC layer is followed by application of some nonlinear activation function. As in the case of CONV layers, our construction is independent of the applied activation function and we omit further discussion of these functions. Our idea is to use the tensor sketch operation (Deﬁnition 2) to sketch either the columns or rows of the weight matrix.

Deﬁnition 4. A SK-FC layer is parametrized by a bias vector b Rd1 and a sequence of matrix pairs (S11, U11), . . . , (S1ℓ, U1ℓ), (S21, U21), . . . , (S2ℓ, U2ℓ) where for i [ℓ], S1i Rk d2, S2i Rd1 k and U1i Rk d1, U2i Rk d2 are independent random scaled sign matrices, which on input hin Rd2 constructs ˆa as follows:

i=1 U 1i S1ihin + 1

i=1 S2i U2ihin + b. (4)

Note that ˆa in the above deﬁnition could be equivalently represented as ˆa = 1 2ℓ Pℓ i=1(S1i 1 U 1i)hin + 1

2ℓ Pℓ i=1(S2i 2 U 2i)hin + b. The number of free parameters overall in all the S1i and S2i matrices put together is ℓk(d1 + d2). Hence, compared to a traditional weight matrix W Rd1 d2, we get a reduction in the number of parameters if kℓ d1d2/(d1 + d2). Another advantage is that the time needed for computing the pre-activation value (ˆa in (4)) in a SK-FC layer is O(ℓk(d1 + d2)) which is smaller than the O(d1d2) time needed in the traditional FC setting if the values of k and ℓsatisfy the above condition. Theoretical Guarantees of SK-FC Layer. Given a traditional FC layer with weight matrix W (as in (3)), and independent random scaled sign matrices U11, . . . , U1ℓ, U21, . . . , U2ℓ, we can form a corresponding SK-FC layer by setting S1i = U1i W and S2i = WU 2i. We now analyze properties of this construction. The next theorem, based on Proposition 2.1, analyzes the expectation and the variance of using these sketches as an estimator for a = Whin + b for a vector hin Rd2.

Theorem 3.2. Let W Rd1 d2. Let U11, . . . , U1ℓ Rk d1 and U21, . . . , U2ℓ Rk d2 be a set of independent random scaled sign matrices. Let S1i = U1i W(= W 1 U1i) and S2i = WU 2i(= W 2 U2i) for i [ℓ]. Then for any hin Rd2 and b Rd1 with a = Whin + b:

1. Unbiased Estimation: E[ˆa] = a

2. Variance Bound:

E h ˆa a 2i d1 Whin 2

ℓk + W 2 F hin 2

Training a SK-FC Layer We discuss a procedure for training a network containing SK-FC layers. Let Loss() denote some loss function for the

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

network. Let c = S2U2hin + b. Let g = Loss

c . In this case, using chain-rule of calculus

S2 = gh in U 2 = (gh in) 2 U2. (5)

Similarly, the gradient with respect to hin is:

hin = (S2U2) g = (S2 2 U 2 ) g. (6)

Now let c = U 1 S1hin + b = (S 1 U1) hin + b. Again let g = Loss

c . Applying chain-rule gives Loss

S1 = Pd1 i=1 Loss

ci ci S1 , where ci denotes the ith entry of c. We can

S1 = u 1i S1hin

S1 = u1ih in, where u1i is the ith column in U1. Therefore, we get

i=1 giu1ih in = U1gh in = (gh in) 1 U1, (7)

where gi denotes the ith entry of g. Finally, the gradient with respect to hin in this case equals:

hin = (S 1 U1)g = (S1 1 U 1 ) g. (8)

Putting together (5), (6), (7), and (8) gives the necessary gradients for the SK-FC layer (where ˆa is deﬁned using (4)). Let g = Loss

ˆa . For i [ℓ],

S1i = U1igh in 2ℓ , Loss

S2i = gh in U 2i 2ℓ , and

hin = Pℓ i=1 S 1i U1ig

2ℓ + Pℓ i=1 U 2i S2ig

Note that all the above computations can be performed without ever explicitly forming the complete d1 d2 weight matrix.

3.3 Final Construction of d NN Given a convolutional neural network NN, construct d NN, an approximation of NN, by replacing the convolutional layers (resp. fully connected layers) with SK-CONV layers (resp. SKFC layers). A nice feature about this construction is that, based on need, we can also choose to replace only some of the layers of the NN with their sketch counterpart layers.

4 Comparison to Previous Work Deep neural networks are typically over-parameterized, and there is signiﬁcant redundancy in deep learning networks [Denil et al., 2013]. There have been several previous attempts to reduce the complexity of deep NN under a variety of contexts. Most relevant to our paper is a line of work on approximating both the fully connected and convolutional layers. Denil et al. [Denil et al., 2013], suggested an approach based on learning a low-rank factorization of the matrices involved within each layer of a CNN. Instead of learning both the factors of a factorization during training, the authors suggest techniques for carefully constructing one of the factors (called the dictionary), while only learning the other one. Our sketching-based approach is related to low-rank factorization, however using

sketching we eliminate the overhead of carefully constructing the dictionary. Tai et al. [Tai et al., 2016] achieve parameter reduction using a tensor decomposition technique that is based on replacing the convolutional kernel with two consecutive kernels with a lower rank. The issue with this approach is that with the increased depth of the resulting network, training becomes more challenging, and the authors rely on batch normalization [Ioffe and Szegedy, 2015] to overcome this issue. In our approach, the depth of the reduced network remains equal to that of the original network, and the reduced network can be trained with or without batch normalization. Chen et al. [Chen et al., 2016] combine the hashing idea from [Chen et al., 2015] along with the discrete cosine transform (DCT) to compress ﬁlters in a convolutional layer. Their architecture, called Fresh Nets, ﬁrst converts ﬁlter weights into frequency domain using discrete cosine transform and then uses the hashing idea to randomly group the resulting frequency parameters into buckets. Our sketches are created by using random projections which is related to the hashing trick used in these results, however, our techniques are naturally attractive for convolutional neural networks as they are known to be preserve spatial locality [Johnson and Lindenstrauss, 1984], a property that is not preserved by simple hashing. Also, in contrast to Fresh Nets, our architectures require just simple linear transformations for both fully connected and convolutional layers, and do not require special routines for DCT, Inverse DCT, etc. There is a long line of work on reducing model memory size based on post-processing a trained network (with sometimes further ﬁne-tuning of the compressed model) [Gong et al., 2014; Han et al., 2015; Soulié et al., 2015; Wu et al., 2015; Guo et al., 2016; Kim et al., 2016; Wang et al., 2016; Hubara et al., 2016b; Zhu et al., 2016; Li et al., 2016]. Techniques such as pruning, binarization, quantization, low-rank decomposition, etc., are intermingled with training of a network on a dataset to construct a reduced model. These results do not achieve a direct network approximation as the training happens on the original network. In practice, one can combine our approach with some of the above proposed model post-processing techniques to further reduce the storage requirements of the trained model (which is beyond the scope of this paper).

5 Experimental Evaluation

In this section, we experimentally demonstrate the effectiveness of our proposed network approximation approach. Metrics. We deﬁne compression rate as the ratio between the number of parameters in the reduced (compressed) network architecture and the number of parameters in the original (uncompressed) network architecture. The top-1 error of a trained model is denoted by ERRTOP-1. Datasets. We use 5 popular image datasets: CIFAR10, SVHN, STL10, Image Net10 (a subset of Image Net1000 dataset), and Places2. Note that, Places2 is a challenging dataset that was used in the ILSVRC 2016 Scene Classiﬁcation challenge. Network Architectures. We present our experiments on two different network architectures: Network-in-Network [Lin et al., 2014] (Nin N) and Goog Le Net [Szegedy et al., 2015] (which we use for the Places2 dataset). The choice of ar-

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Figure 2: Top-1 error for Nin N as we decrease the compression rate by compressing one convolutional layer at a time each by a factor of 10.

Figure 3: Top-1 error for Nin N+FC as we decrease the compression rate by compressing one convolutional layer at a time each by a factor of 4 and the fully connected layer by a factor of 4. The size of FC layer is about half of the total size of convolutional layers CONV2 to CONV8.

chitectures was done keeping in mind limited computational resources at our disposal and a recent trend of moving away from fully connected layers in CNNs. A common observation is that reducing the number of parameters in convolutional layers seems to be a much more challenging problem than that for fully connected layers. Nin N achieves a baseline top-1 error of 17.7, 43.2, 6.0, and 27.1 on the CIFAR10, STL10, SVHN, and Image Net10 datasets respectively. Similarly, Goog Le Net achieves a baseline top-1 error of 32.3% on the Places2 dataset. Baseline Techniques. We compare our approach with four state-of-the-art techniques that approximate both the convolutional and the fully connected layers: Fresh Nets technique that uses hashing in the frequency domain to approximate the convolutional layer [Chen et al., 2016], low-rank decomposition technique of [Denil et al., 2013] (LOWRANK1), and tensor decomposition technique of [Tai et al., 2016] (LOWRANK2). While using the Fresh Nets, we also use the Hashed Nets technique of feature hashing [Chen et al., 2015] for compressing the fully connected layers as suggested by [Chen et al., 2016]. Results. Figure 2 shows the results of our ﬁrst set of experiments. In this case, we use the Nin N architecture. If a point is missing in the plots then the corresponding network training failed. We expect the error to go up as we decrease the compression rate, i.e., increase the parameter reduction. We observe this general trend in almost all our plots, with minor ﬂuctuations on the SVHN dataset. We make two main observations from these plots. First, our method was always able to get to a better compression rate compared to other techniques, in that these comparative techniques started failing sooner as we kept decreasing the compression rate. For example, our approach consistently achieves a compression rate of 0.15 that none of the other techniques even get close to achieving. Second, our approach also almost always achieves better accuracy when compared to other techniques. As explained in Section 4, our approach has some advantages over the compared techniques, especially in terms of its ability to approximate

(compress) the convolutional layers. Next we consider results on both the convolutional and fully connected layers. We now add fully connected layers into the mix. To do so, we used a modiﬁed Nin N architecture (denoted as Nin N+FC) in our experiments where we replaced the last convolution layer (CONV9) with a fully connected layer of size 768 768 followed by a classiﬁer layer of size 768 10. In Figure 3, we present the results of these experiments. Our approach again outperforms other techniques in terms of both accuracy and the maximum achievable compression rate. The results demonstrate the effectiveness of proposed approach on both the convolutional and fully connected layers. An interesting observation from our experiments is that we can gain up to 4% or lose up to 2% of accuracy compared to original network accuracy. The fact that sometimes our reduced network was able to gain a bit of accuracy over the original network suggests that our randomized technique also acts as an implicit regularizer during training. To evaluate our approach on a large dataset, we ran additional experiments on the Places2 dataset (using a centered crop). Here we used the Goog Le Net architecture with batch normalization. Due to limited computational resources, we ran a single experiment where we compressed all but the ﬁrst layer to achieve a compression rate of about 0.2. At this compression level, training for none of the competitor methods succeeded, whereas, our approach gave a top-1 error of 36.4%. Note that the top-1 error of the original Goog Le Net on this dataset is 32.3%. This demonstrates that our approach manages to generate smaller networks that perform well even on large datasets.

[Chen et al., 2015] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In ICML, 2015.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

[Chen et al., 2016] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Q Weinberger, and Yixin Chen. Compressing convolutional neural networks in the frequency domain. In KDD, 2016. [Cheng et al., 2015] Yu Cheng, Felix X Yu, Rogerio S Feris, Sanjiv Kumar, Alok Choudhary, and Shi-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In ICCV, pages 2857 2865, 2015. [Chetlur et al., 2014] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efﬁcient primitives for deep learning. Ar Xiv, 2014. [Denil et al., 2013] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning. In NIPS, pages 2148 2156, 2013. [Garipov et al., 2016] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization: compressing convolutional and fc layers alike. Ar Xiv, 2016. [Gong et al., 2014] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. Ar Xiv, 2014. [Goodfellow et al., 2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. Book in preparation for MIT Press, 2016. [Guo et al., 2016] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efﬁcient dnns. In NIPS, 2016. [Han et al., 2015] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efﬁcient neural network. In NIPS, 2015. [Han et al., 2016] Song Han, Huizi Mao, and William J Dally. A deep neural network compression pipeline: Pruning, quantization, huffman encoding. In ICLR, 2016. [Hubara et al., 2016a] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107 4115, 2016. [Hubara et al., 2016b] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. ar Xiv preprint ar Xiv:1609.07061, 2016. [Ioffe and Szegedy, 2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448 456, 2015. [Johnson and Lindenstrauss, 1984] William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189206):1, 1984. [Kim et al., 2016] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In ICLR, 2016.

[Li et al., 2016] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. ar Xiv preprint ar Xiv:1605.04711, 2016. [Lin et al., 2014] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In ICLR, 2014. [Sindhwani et al., 2015] Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. Structured transforms for small-footprint deep learning. In NIPS, 2015. [Soulié et al., 2015] Guillaume Soulié, Vincent Gripon, and Maëlys Robert. Compression of deep neural networks on the ﬂy. Ar Xiv, 2015. [Szegedy et al., 2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. [Tai et al., 2016] Cheng Tai, Tong Xiao, Xiaogang Wang, et al. Convolutional neural networks with low-rank regularization. In ICLR, 2016. [Umans, 1998] Christopher Umans. The minimum equivalent dnf problem and shortest implicants. In Foundations of Computer Science, 1998. Proceedings. 39th Annual Symposium on, pages 556 563. IEEE, 1998. [Wang et al., 2016] Yunhe Wang, Chang Xu, Shan You, Dacheng Tao, and Chao Xu. Cnnpack: Packing convolutional neural networks in the frequency domain. In NIPS, 2016. [Woodruff, 2014] David P. Woodruff. Sketching as a tool for numerical linear algebra. Fn T-TCS, 2014. [Wu et al., 2015] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. Ar Xiv, 2015. [Yang et al., 2015] Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang. Deep fried convnets. In ICCV, 2015. [Zhu et al., 2016] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. ar Xiv preprint ar Xiv:1612.01064, 2016.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)