# decoupling_the_layers_in_residual_networks__4aaee1f8.pdf

DECOUPLING THE LAYERS IN RESIDUAL NETWORKS

Ricky Fok , Aijun An, Zana Rashidi Department of Electrical Engineering and Computer Science York University 4700 Keele Street, Toronto, M3J 1P3, Canada ricky.fok3@gmail.com, aan@cse.yorku.ca, zrashidi@cse.yorku.ca

Xiaogang Wang Department of Mathematics and Statistics York University 4700 Keele Street, Toronto, M3J 1P3, Canada stevenw@mathstat.yorku.ca

We propose a Warped Residual Network (Warp Net) using a parallelizable warp operator for forward and backward propagation to distant layers that trains faster than the original residual neural network. We apply a perturbation theory on residual networks and decouple the interactions between residual units. The resulting warp operator is a ﬁrst order approximation of the output over multiple layers. The ﬁrst order perturbation theory exhibits properties such as binomial path lengths and exponential gradient scaling found experimentally by Veit et al. (2016). We demonstrate through an extensive performance study that the proposed network achieves comparable predictive performance to the original residual network with the same number of parameters, while achieving a signiﬁcant speed-up on the total training time. As Warp Net performs model parallelism in residual network training in which weights are distributed over different GPUs, it offers speed-up and capability to train larger networks compared to original residual networks.

1 INTRODUCTION

Deep Convolution Neural Networks (CNN) have been used in image recognition tasks with great success. Since Alex Net (Krizhevsky et al., 2012), many other neural architectures have been proposed to achieve start-of-the-art results at the time. Some of the notable architectures include, VGG (Simonyan & Zisserman, 2015), Inception (Szegedy et al., 2015) and Residual networks (Res Net)(He et al., 2015).

Training a deep neural network is not an easy task. As the gradient at each layer is dependent upon those in higher layers multiplicatively, the gradients in earlier layers can vanish or explode, ceasing the training process. The gradient vanishing problem is signiﬁcant for neuron activation functions such as the sigmoid, where the gradient approaches zero exponentially away from the origin on both sides. The standard approach to combat vanishing gradient is to apply Batch Normalization (BN) (Ioffe & Szegedy, 2015) followed by the Rectiﬁed Linear Unit (Re LU) (Hahnloser et al., 2000) activation. More recently, skip connections (Srivastava et al., 2015) have been proposed to allow previous layers propagate relatively unchanged. Using this methodology the authors in (Srivastava et al., 2015) were able to train extremely deep networks (hundreds of layers) and about one thousand layers were trained in residual networks (He et al., 2015).

As the number of layers grows large, so does the training time. To evaluate the neural network s output, one needs to propagate the input of the network layer by layer in a procedure known as forward propagation. Likewise, during training, one needs to propagate the gradient of the loss function from the end of the network to update the model parameters, or weights, in each layer of the network using gradient descent. The complexity of forward and propagation is O(K), where

Corresponding Author

K is the number of layers in the network. To speed up the process, one may ask if there exist a shallower network that accurately approximates a deep network so that training time is reduced. In this work we show that there indeed exists a neural network architecture that permits such an approximation, the Res Net.

Residual networks typically consist of a long chain of residual units. Recent investigations suggest that Res Nets behave as an ensemble of shallow networks (Veit et al., 2016). Empirical evidence supporting this claim includes one that shows randomly deactivating residual units during training (similar to drop-out (Srivastava et al., 2014)) appears to improve performance (Huang et al., 2016). The results imply that the output of a residual unit is just a small perturbation of the input. In this work, we make an approximation of the Res Net by using a series expansion in the small perturbation. We ﬁnd that merely the ﬁrst term in the series expansion is sufﬁcient to explain the binomial distribution of path lengths and exponential gradient scaling experimentally observed by Veit et al. (2016). The approximation allows us to effectively estimate the output of subsequent layers using just the input of the ﬁrst layer and obtain a modiﬁed forward propagation rule. We call the corresponding operator the warp operator. The backpropagation rule is obtained by differentiating the warp operator. We implemented a network using the warp operator and found that our network trains faster on image classiﬁcation tasks with predictive accuracies comparable to those of the original Res Net.

Our contributions in this work include

We analytically investigate the properties of Res Nets. In particular, we show that the ﬁrst order term in the Taylor series expansion of the layer output across K residual units has a binomial number of terms, which are interpreted as the number of paths in Veit et al. (2016), and that for Re LU activations the second and higher order terms in the Taylor series vanish almost exactly.

Based on the above-mentioned analysis, we propose a novel architecture, Warp Net, which employs a warp operator as a parallelizable propagation rule across multiple layers at a time. The Warp Net is an approximation to a Res Net with the same number of weights.

We conduct experiments with Warp Net skipping over one and two residual units and show that Warp Net achieves comparable predictive performance to the original Res Net while achieving signiﬁcant speed-up. Warp Net also compares favorably with data parallelism using mini-batches with Res Net. As opposed to data parallelized Res Net where nearly all the weights are copied to all GPUs, the weights in Warp Net are distributed over various GPUs which enables training of a larger network.

The organization of this paper is as follow. In Section 2 we analyze the properties of Res Net and show that the binomial path length arises from a Taylor expansion to ﬁrst order. In Section 3 we describe Warped Residual Networks. In Section 4 we show that Warp Net can attain similar performence as the original Res Net while offering a speed-up.

2 PROPERTIES OF RESNETS

In this section we show that recent numerical results (Veit et al., 2016) is explained when the perturbation theory is applied to Res Nets. Consider the input xi of the i-th residual unit and its output xi+1, where xi+1 = hi(xi) + Fi(xi, Wi). (1)

Typically, h(xi) is taken to be an identity mapping, hi(xi) = xi. When the feature maps are down sampled, h is usually taken to be an 1 1 convolution layer with a stride of 2. The functions Fi is a combination of convolution, normalization and non-linearity layers, so that Wi collectively represents the weights of all layers in Fi. In this work we only consider the case where the skip connection is the identity, hi(xi) = xi.

Perturbative feature map ﬂow First, we show that the interpretation of Res Nets as an ensemble of subnetworks is accurate up to the ﬁrst order in F with identity mapping. One can approximate the output of a chain of residual units by a series expansion. For instance, the output of two residual units x3 is related to the input of the ﬁrst unit by the following (we call the process where xk is

expressed in terms of xk 1 an iteration. The following equations show two iterations). x3 = x2 + F2(x2, W 2) = x1 + F1(x1, W 1) + F2(x1 + F1(x1, W 1), W 2)

= x1 + F1(x1, W 1) + F2(x1, W 2) + F1(x1, W 1)F 2(x1, W 2) + O(ϵ2), (2) where F 2(x1, W 2) denotes the partial derivative of F2 with respect to x1 and W i denotes the weights at the loss minimum. A Taylor series expansion in powers of F1 was performed on F2 in the second line above.1 The O(ϵ2) term arises from the Taylor series expansion, representing higher order terms. Equation (2) can be interpreted as an ensemble sum of subnetworks.

Below we show that the second and higher order terms are negligible, that is, the ﬁrst order Taylor series expansion is almost exact, when Re LU activations are used. The second order perturbation terms all contain the Hessian F (x). But after the network is trained, the only non-linear function in F, Re LU, is only non-linear at the origin2. Therefore all second order terms vanish almost exactly. The same argument applies to higher orders.

Theorem 1: Binomial Path Length Let the set of indices σc = {c(1), c(2), . . . c(k)} obtained by choosing any subset of SK = {1, 2, , K}, k < K, and then ordering such that c(k) > c(k 1) > > c(1). The output across K residual units with Re LU non-linearity is

x K+1 = x1 + X

σc P(SK)\{ }

i=2 F c(i)(x1, W c(i)) Fc(1)(x1, W c(1)), (3)

where the sum is over all subsets σc and P(SK) denotes the power set of SK. We have omitted the O(ϵ2) term because the ﬁrst order approximation is almost exact when Re LU is used as discussed above. The right hand side of Equation (3) is interpreted as the sum over subnetworks or paths in the sense of Veit et al. (2016). The identity path corresponding to σc = { } gives x1 in the ﬁrst term. If there is only one element in σc, such that its cardinality |σc| = 1, the product on the right hand side in parentheses is absent and only terms proportional to Fc(1) appears in the sum, where c(1) {1, . . . , K}. We provide the proof of Equation (3) in Appendix A.

We can make the equation simpler, solely for simplicity, by setting all weights to be the same such that Fc(i) = F and W c(i) = W for all i,

x K+1 = x1 +

(F (x1, W ))k 1F(x1, W ). (4)

The binomial coefﬁcients appear because the number of subsets of SK with cardinality k is K k . Note that the implementations of our proposed method (described in Section 3) do not use this simpliﬁcation.

Exponential gradient scaling Similarly, one observes that the gradient is the sum from all subnetwork contributions, including the identity network. The magnitudes of subnetwork gradients for an 110 layer Res Net have been measured by (Veit et al., 2016). If one takes F to have Re LU nonlinearity, then F (x, W ) = 0 except at the origin. The non-trivial gradient can be expressed almost exactly as K k (F (x, W ))k. This validates the numerical results that the gradient norm decreases exponentially with subnetwork depth as reported in (Veit et al., 2016). Their experimental results indicate that the average gradient norm for each subnetwork of depth k is given by ||F (x, W )||k.

All aforementioned properties apply only after the Res Nets are trained. However, if an approximation in the network is made, it would still give similar results after training. We show in the following sections that our network can attain similar performances as the original Res Net, validating our approximation.

3 WARPED RESIDUAL NETWORK

The Warped Residual Network (Warp Net) is an approximation to the residual network, where K consecutive residual units are compressed into one warp layer. The computation in a warp layer is

1The Taylor expansion for multivariate functions is f(x + a) = f(x) + a xf(x) + 2Batch normalization layers are non-linear during training, due to the scaling by the sample variance.

different from that in a conventional neural network. It uses a warp operator to compute the output (i.e., x K+1) of the layer directly from the input (i.e., x1), as shown in Equation (4). The number of weights in a warped layer is the same of the one in the original residual network for K consecutive residual units. For instance, the weights W1, W2 up to WK are present in a warped layer. But these weights can be used and updated in parallel due to the use of the warp operator. Below we ﬁrst describe the forward and backward propagation rules used in warped residual network.

3.1 FORWARD PROPAGATION ACROSS WARP OPERATORS

This section shows the propagation rules of the Warped Residual Network using the warp operator Twarp.

The expression for Twarp is derived from Equation 3, that is, by using the Taylor series expansion to the ﬁrst order:

x K+1 = x1 + X

σc P(SK)\{ }

i=2 F c(i)(x1, W c(i)) Fc(1)(x1, W c(1)),

Note that Twarp can be calculated in a parallelizable manner for all K. This is shown in Figure 1 with K = 2, where 3

x3 = T K=2 warp(x1, W1, W2)

= x1 + F1(x1, W1) + F2(x1, W2) + F 2(x1, W2)F1(x1, W1), (5)

and Wi corresponds to the weights in the i-th residual unit in the original Res Net. The formula for the K = 3 case is shown in Appendix A.

3.2 WARPED BACK-PROPAGATION

Now we derive the backpropagation rules. Suppose that the upstream gradient L/ x5 is known and we wish to compute L/ W1 for gradient descent. We ﬁrst back propagate the gradient down from x5 to x3. With x5 = Twarp(x3), we can derive the backpropagated gradient

L x3 = L x5 x5

= L x5 I + F3(x3, W3)

x3 + F4(x3, W4)

x3 + F4(x3, W4)

where I is the identity matrix and we have set the derivative of F 4 to zero for Re LU non-linearities. Note that we have removed all BN layers from F 4 in our implementation. One sees that the same kind of parallelism in the warp operator is also present for back propagation. Now we can evaluate the weight gradient for updates

x3 F1(x1, W1)

W1 + F2(x1, W2)

Similarly for the update rule for W2. Rules for the all other weights in Warp Net can be obtained in the same way,

x3 F2(x1, W2)

W2 + 2F2(x1, W2)

x1 W2 F1(x1, W1) .

The weights W1 and W2 can be updated in parallel independently. The derivative F2(x1, W2)/ x1 (in L/ W1) is already computed in the forward pass which could be saved and reused. Furthermore, derivatives other than F 3 needed in L/ x3 can also be computed in the forward pass. For higher warp factors K, only the derivative F K+1 is not available after the forward pass.

3This equation is from Equation (2) after dropping the negligible O(ϵ2) terms.

Figure 1: Forward propagation in Warp Net. Left, the original residual network where each block corresponds to a residual unit. Middle, the corresponding Warped Network with indices determined by the architecture of M. The diagram on the right shows the parallelism allowed in the forward pass of Warp Net. Note that F2 shares the same weights with F 2, as they come from the same residual unit in the original network M.

4 EXPERIMENTS

In this section we discuss our implementation of the Warp Net architecture and the experimental results. In order to ensure the validity of the series expansion we replace the 1 1 convolution layers on skip connections by an average pooling layer and a concatenate layer before the residual unit to reduce the spatial dimensions of feature maps and multiply their channels. In this way all skip connections are identity mappings. We adopt a wide residual architecture (WRN) (Zagoruyko & Komodakis, 2016). The convolution blocks F comprised of the following layers, from input to output, BN-Conv-BN-Re LU-Conv-BN (Han et al., 2016). The neural architecture of Warped Residual Networks is shown in Table 1. The layers [T (K) warp] Nwarp represent forward propagating Nwarp times, such that xi+K = T (K) warp(xi, Wi, Wi+1, . . . , Wi+K 1) and the indices i correspond to the indices in the original residual network.

Using Tensorﬂow, we implemented a Warp Net with various parameters, kw, K and Nwarp. The widening factor (Zagoruyko & Komodakis, 2016) is kw, K is the warp factor and with the scheme shown in Figure 1. We employ Tensorﬂow s automatic differentiation for backpropagation, where the gradients are calculated by sweeping through the network through the chain rule. Although the gradients computed in the forward pass can be re-used in the backward pass, we do not do so in our experiment and leave it to future work to potentially further speed up our method. Even so, the experimental results indicate that Warp Net can be trained faster than WRN with comparable predictive accuracy.

Consider the case K = 2, we found that the computation bottleneck arises from the BN layers in F 2. The reason being the gradient of BN layers contains an averaging operation that is expensive to compute. In our ﬁnal implementation we removed all BN layers in F 2 from our network. This results in a departure from our series approximation but it turns out the network still trains well. This is because the normalizing layers are still being trained in F1,2. To further improve the speed-up we replace the F1 block in the derivative term F 2F1 with the input x1 so that the term becomes F 2x1. Similar approximations are made in cases where K > 2. We have conducted extensive experiments of this modiﬁcation and found that it has similar predictive accuracies while improving speed-up. In the following, we refer to this modiﬁcation of Warp Net as Warp Net1 and the one with F 2F1 as Warp Net2. For K = 3 we replace all F j Fi by F jx1 in Warp Net1. We also drop the term F 3F 2F1 in computing x4 in both versions of Warp Net due to the limited GPUs we have in the expriements.

To investigate the speed-up provided by Warp Net and its predictive performance with various approximations on the warp operators, we deﬁne the relative speed-up, RS, compared to the corresponding wide residual network (WRN) as

RS = tres twarp

where twarp is the total time to process a batch for Warp Net during training, and tres is that for the baseline WRN.

Table 1: Network architecture of Warp Net. H W C Conv blocks 32 32 3 Input 32 32 16 Conv-BN-Re LU Stage 1 Concatenate 32 32 16kw [T (K) warp] Nwarp Stage 2 Avg pool(2x2)-Concatenate 16 16 32kw [T (K) warp] Nwarp Stage 3 Avg pool(2x2)-Concatenate 8 8 64kw [T (K) warp] Nwarp 1 1 64kw BN-Re LU-Average Pooling(8,8) Fully Connected(# of classes)

4.1 RESULTS ON CIFAR-10 AND CIFAR-100

For the CIFAR-10 and CIFAR-100 data sets, we trained for 80000 iterations, or 204 epochs. We took a training batch size of 128. Initial learning rate is 0.1. The learning rate drops by a factor of 0.1 at epochs 60, 120, and 160, with a weight decay of 0.0005. We use common data augmentation techniques, namely, whitening, ﬂipping and cropping.

We study the performance of Warp Net with K = 2 and K = 3. The averaged results over two runs each are shown in Tables 2 and 3. The ﬁrst column in the Tables represents the methods including two versions of Warp Net with different modiﬁcations we have made in the warp operator. A wide Res Net (WRN) is obtained by replacing [T (K) warp] Nwarp with K Nwarp residual units in Table 1. The total number of convolution layers (represented by n in WRN-n-kw) is 6KNwarp +1, where the factor of 6 arise from two convolution layers in each residual unit and 3 stages in the network, plus 1 convolution layer at the beginning. The number of layers in WRN is always odd as we do not use the 1 1-convolution layer across stages.

We see that in most cases, Warp Net can achieve similar, if not better, validation errors than the corresponding wide Res Net while offering speed-up. The experiments also show that the modiﬁcation of replacing F F by F x1, where x1 is the input of the warp operator, achieves better accuracy most of the time while improving the speed-up. We observe that increasing from K = 2 to K = 3, using only one more GPU, signiﬁcantly improves speed-up with only a slight drop in validation accuracy compared to the K = 2 case.

We have also performed experiments on the speed-up as the widening factor kw increases. We found that the speed-up increases as the Warp Net gets wider. For kw = 4, 8 and 16, the speed-up in total time for K = 2 is 35%, 40% and 42% respectively. The speed-up also increases with the warp factor K, for K = 3 using the F x modiﬁcation, the speed-ups are 44%, 48% and 50% respectively.

4.2 RESULTS ON IMAGENET

We also tested Warp Net on a down-sampled (32x32) Image Net data set (Chrabaszcz & Hutter, 2017). The data set contains 1000 classes with 1281167 training images and 50000 validation images with 50 images each class. The training batch size is 512, initial learning rate is 0.4 and drops by a factor of 0.1 at every 30 epochs. The weight decay is set to be 0.0001. We use the overall best performing warp operator in the CIFAR experiments, namely, the one containing F x.

The results are shown in Table 4 and Figure 2. First, we show directly that for a given Res Net there exists a Warp Net that obtains a higher validation accuracy with shorter training time. We increase K from 2 to 3 and keep everything else ﬁxed. This corresponds to Warp Net-109-2. The network has more residual units than WRN-73-2. We observed that Warp Net-109-2 trains 12% faster than WRN-73-2 while resulting in a better validation accuracy. Second, Warp Net can achieve close to the benchmark validation error of 18.9% with WRN-28-10 in (Chrabaszcz & Hutter, 2017). Note that we were not able to train the corresponding WRN-73-4 on the dataset as the model requires too much memory on a single GPU. This shows that the weight distribution of Warp Net across GPUs

Table 2: Validation error and relative speed-up in parentheses ( ) of Warp Net with K = 2. Brackets [ ] under GPU assignment indicates a GPU is used to compute (and store the weights needed by) the operation enclosed. 3 GPUs are used in this experiment. The corresponding WRN with the same number of parameters is listed in the last row in each block.

CIFAR-10, Nwarp = 3 GPU assignment kw = 4 kw = 6 Warp Net1-37-kw [F1], [F2], [F 2x1] 5.18 (33%) 4.79 (36%) Warp Net2-37-kw [F1], [F2], [F 2F1] 5.15 (23%) 4.92 (26%) WRN-37-kw [F1, F2] 4.97 (0%) 5.01 (0%) CIFAR-10, Nwarp = 6 GPU assignment kw = 4 kw = 6 Warp Net1-73-kw [F1], [F2], [F 2x1] 4.51 (34%) 4.76 (36%) Warp Net2-73-kw [F1], [F2], [F 2F1] 4.86 (24%) 4.91 (26%) WRN-73-kw [F1, F2] 4.80 (0%) 4.71 (0%) CIFAR-100, Nwarp = 3 GPU assignment kw = 4 kw = 6 Warp Net1-37-kw [F1], [F2], [F 2x1] 22.4 (33%) 21.6 (36%) Warp Net2-37-kw [F1], [F2], [F 2F1] 22.8 (23%) 22.0 (26%) WRN-37-kw [F1, F2] 22.8 (0%) 21.6 (0%) CIFAR-100, Nwarp = 6 GPU assignment kw = 4 kw = 6 Warp Net1-73-kw [F1], [F2], [F 2x1] 21.4 (34%) 21.1 (36%) Warp Net2-73-kw [F1], [F2], [F 2F1] 21.7 (24%) 21.4 (26%) WRN-73-kw [F1, F2] 21.8 (0%) 21.9 (0%)

Table 3: Validation error and relative speed-up in parentheses ( ) of Warp Net with K = 3. Brackets [ ] under GPU assignment indicates a GPU is used to compute (and store the weights needed by) the operation enclosed. 4 GPUs are used in this experiment. The corresponding WRN with the same number of parameters is listed in the last row.

CIFAR-10, Nwarp = 2 GPU assignment kw = 4 kw = 6 Warp Net1-37-kw [F1], [F2], [F3], [F 2x1, F 3x1] 5.39 (43%) 5.35 (47%) Warp Net2-37-kw [F1], [F2], [F3], [F 2F1, F 3F1, F 3F2] 5.52 (34%) 5.22 (33%) WRN-37-kw [F1, F2, F3] 4.97 (0%) 5.01 (0%) CIFAR-10, Nwarp = 4 GPU assignment kw = 4 kw = 6 Warp Net1-73-kw [F1], [F2], [F3], [F 2x1, F 3x1] 4.66 (45%) 4.64 (46%) Warp Net2-73-kw [F1], [F2], [F3], [F 2F1, F 3F1, F 3F2] 4.81 (34%) 4.77 (34%) WRN-73-kw [F1, F2, F3] 4.80 (0%) 4.71 (0%) CIFAR-100, Nwarp = 2 GPU assignment kw = 4 kw = 6 Warp Net1-37-kw [F1], [F2], [F3], [F 2x1, F 3x1] 22.8 (43%) 22.2 (47%) Warp Net2-37-kw [F1], [F2], [F3], [F 2F1, F 3F1, F 3F2] 23.0 (34%) 23.1 (33%) WRN-37-kw [F1, F2, F3] 22.8 (0%) 21.6 (0%) CIFAR-100, Nwarp = 4 GPU assignment kw = 4 kw = 6 Warp Net1-73-kw [F1], [F2], [F3], [F 2x1, F 3x1] 22.0 (45%) 21.4 (46%) Warp Net2-73-kw [F1], [F2], [F3], [F 2F1, F 3F1, F 3F2] 21.9 (34%) 21.2 (35%) WRN-73-kw [F1, F2, F3] 21.8 (0%) 21.9 (0%)

allows a bigger network to be trained. Remarkably, the validation error curve for WRN-73-2 and its approximation Warp Net 73-2 (K = 2, Nwarp = 6) lie almost exactly on top of each other. This suggests that our implementation of Warp Net is a good approximation of the corresponding WRN throughout training.

4.3 COMPARISON WITH DATA PARALLELISM

Warp Net offers model parallelism to Res Net learning, in which different sets of weights are learned in parallel. In comparison, a popular way to parallelize deep learning is to split the batch in each training iteration into subsets and allow a different GPU to compute gradients for all weights based on a different subset and synchronization can be done, e.g., by averaging the gradients from all GPUs and updating the weights based on the average. We refer to such methods as data parallelism methods. Below we compare Warp Net with a data parallelism method on 2 or 4 GPUs on CIFAR-10

Table 4: Top-5 validation errors (%) on Image Net32 32 and the relative speed-up compared to WRN-73-2 in parentheses for kw = 2. The warp operator containing F x is used for Warp Net.

# of parameters Image Net32 32 WRN-73-2 4.8M 23.0 (0%) Warp Net1-73-2 (K = 2, Nwarp = 6) 4.8M 23.0 (27%) Warp Net1-73-2 (K = 3, Nwarp = 4) 4.8M 23.5 (41%) Warp Net1-109-2(K = 3, Nwarp = 6) 7.1M 22.1 (12%) Warp Net1-73-4 (K = 2, Nwarp = 6) 18.9M 19.5 WRN-28-10 (Chrabaszcz & Hutter, 2017) 37.1M 18.9

Figure 2: Top-5 validation error of the down-sampled Image Net data set for kw = 2.

for which we divide each batch into 2 or 4 mini-batches, respectively, and synchronization is done right after all GPUs ﬁnish their job on their mini-batch to avoid harming the accuracy. Table 5 shows the average result over 2 runs for each method. All methods see the same volume of data during training, which means that the number of epochs is the same for all methods. We chose the warp operators containing F x in this experiment, that is, Warp Net1 whose operations are speciﬁed in the ﬁrst rows of each block in Tables 2 and 3. We use the GPU assignment [F1], [F2, F 2x1], [F3, F 3x1] for the case with 3 GPUs.

The results show that Warp Net is more accurate than data parallelism in both 2-GPU and 4-GPU cases. When 3 or 4 GPUs are used, Warp Net is much faster than data-parallelized Res Net with 4 GPUs. We believe this is because the data parallelism method needs to store all the weights of the model in all GPUs and its speed is slowed by the need to update all the weights across all GPUs at the time of synchronization. In comparison, Warp Net splits the weights among GPUs and each GPU only maintains and updates a subset of weights. Such weight distributions in Warp Net require less GPU memory, which allows it to train larger networks. Furthermore, data parallelism can be applied to Warp Net as well to potentially further speed up Warp Net, which is a topic beyond the scope of this paper.

5 CONCLUDING REMARKS

In this paper, we proposed the Warped Residual Network (Warp Net) that arises from the ﬁrst order Taylor series expansion with Re LU non-linearity. We showed analytically that the ﬁrst order expansion is sufﬁcient to explain the ensemble behaviors of residual networks (Veit et al., 2016). The Taylor series approximation has the structure that allows Warp Net to train consecutive residual units in parallel while ensuring that the performance is similar to the corresponding Res Net. The weights of different residual units are distributed over the vairous GPUs which enables the training of bigger networks compared to Res Nets given limited GPU memory. Experimental results show that Warp Net can provide a signiﬁcant speed-up over wide Res Nets with similar predictive accuracy, if not

Table 5: Comparison with data parallelism.

# of GPUs Validation error (%) Relative Speed-up WRN-73-4 1 4.80 0% WRN-73-4 2 4.86 29.6% WRN-73-4 4 4.67 32.4% Warp Net1-73-4, K = 2, Nwarp = 6 2 4.51 23.0% Warp Net1-73-4, K = 3, Nwarp = 4 3 4.66 44.1% Warp Net1-73-4, K = 3, Nwarp = 4 4 4.66 44.6%

better. We also show that Warp Net outperforms a data parallelism method on Res Net, achieving better predictive accuracies and a much better speed up when more than 2 GPUs are used.

ACKNOWLEDGMENTS

We thank Wenxin Xu for providing his code for Res Net at https://github.com/wenxinxu/ resnet_in_tensorflow.

Loshchilov I. Chrabaszcz, P. and F. Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. 2017. https://arxiv.org/abs/1707.08819.

R. Hahnloser et al. Digital selection and analogue ampliﬁcation coexist in a cortex-inspired silicon circuit. Nature, 405:947 951, 2000.

D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. 2016. https://ar Xiv.org/ abs/1610.02915.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Co RR, 2015. https://arxiv.org/abs/1512.03385.

G. Huang et al. Deep networks with stochastic depth. Co RR, 2016. https://ar Xiv.org/ abs/1603.09382.

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 448 456, 2015.

A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. NIPS, 2012.

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR, 2015.

N. Srivastava et al. Dropout: a simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research, 15(1):1929 1958, 2014.

R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. 2015. https://arxiv. org/abs/1512.03385.

C. Szegedy et al. Going deeper with convolutions. CVPR, 2015.

A. Veit, Wilber M., and S. Belongie. Residual networks behave like ensembles of relatively shallow networks. 2016. https://ar Xiv.org/abs/1605.06431.

S. Zagoruyko and N. Komodakis. Wide residual networks. 2016. https://ar Xiv.org/abs/ 1605.07146.

A BINOMIAL PATH LENGTH

In this section we explicitly work out the expressions for x3 and x4 using the Taylor expansion and show that in the general case the path lengths k corresponds to the binomial number of terms with power k in F and F together in the ﬁrst order Taylor expansion. The terms of order O(ϵ2) will be omitted in this section. The expression for x3 is

x3 = x2 + F2(x2, W 2) = x1 + F1(x1, W 1) + F2(x1 + F1(x1, W 1), W 2).

Taylor expanding the last term in powers of F1 gives

x3 = x1 + F1(x1, W 1) + F2(x1, W 2) + F2(x, W 2) x

x=x1 F1(x1, W 1)

= x1 + F1(x1, W 1) + F2(x1, W 2) + F 2(x1, W 2)F1(x1, W 1), (7)

where in the last equality we simpliﬁed the notation for the partial derivative, where / x = ( / x1, . . . , / x D) and D is the dimensionality of x. Counting the powers of F and F reveals that there are (1,2,1) terms for each power 0, 1 and 2, respectively. The same (1,2,1) coefﬁcients can also be obtained by setting the weights to be the same W i = W j for all i, j.

x4 = x3 + F3(x3, W 3) = x2 + F2(x2, W 2) + F3(x2 + F2(x2, W 2), W 3) = x2 + F2(x2, W 2) + F3(x2, W 3) + F 3(x2, W 3)F2(x2, W 2). (8)

This is similar to x3 but with indices on the right hand side increased by 1. One more iteration of Taylor expansion gives x4 in terms of x1

x4 = x1 + F1(x1, W 1) + F2(x1, W 2) + F3(x1, W 3) + F 2(x1, W 3)F1(x1, W 1) + F 3(x1, W 3)F1(x1, W 1) + F 3(x1, W 3)F2(x1, W 2) + F 3(x1, W 3)F 2(x1, W 2)F1(x1, W 1). (9)

where we have organized all terms having the same power of F and F together to be in the same row. We also assume Re LU is used so that F 3 = 0 almost exactly. We say that a term in the ﬁrst order expansion has power k if the term is proportional to (F )k 1F. Then there are (1,3,3,1) terms for each power k {1, 2, 3, 4}. A pattern begins to emerge that the number of terms for each power of F satisfy K k , where K is the number skipped, i.e. K = 3 for the x4 to x1 case above.

Now we show that the number of terms in the ﬁrst order expansion is the binomial coefﬁcient for all k. We aim to derive a recursion relationship between each iteration of index reduction. We deﬁne the index reduction as operations that reduce the index of the outputs xi by one. For instance, residual unit formula xi = xi 1 + Fi 1 is an index reduction, where the index is reduced from i to i 1. Note that this operation generates a term of power 1, Fi 1, from a power 0 term xi. The ﬁrst order Taylor expansion generates a term of an additional power with a derivative,

Fi(xi) = Fi(xi 1 + Fi 1(xi 1)) = Fi(xi 1) + F i(xi 1)Fi 1(xi 1),

where an index reduction is used in the ﬁrst equality and the Taylor expansion is used in the second. The dependence on F upon the weights and higher order corrections are omitted to avoid clutter. We see the the combination of an index reduction and Taylor expansion generate terms of powers k and k + 1 with index i 1 from a term of power k of index i. Let C(K, k) be the number of terms of K index reduction operations and power k. For instance, K = 3 corresponds to expressing x4 in terms of x1 as in Equation 9 with C(3, 1) = C(3, 2) = 3 and C(3, 0) = C(3, 3) = 1.

A.1 FIRST PROOF OF THEOREM 1, RECURSION RELATIONS

We now derive a relationship between the number of terms of power k + 1 after K + 1 index reductions with those after K index reductions. Consider the terms corresponding to K + 1 with

power k + 1. There are two sources of such terms. First, those generated by an additional index reduction after K operations and the zeroth order Taylor expansion in terms of power k + 1, there are C(K, k + 1) such terms. Second, those generated by the ﬁrst order Taylor expansion in terms of power k, there are C(K, k) such terms. Therefore the total number of terms with power k + 1 after K + 1 index reductions is

C(K + 1, k + 1) = C(K, k + 1) + C(K, k).

This is precisely the recursion formula satisﬁed by the binomial coefﬁcients. We have explicitly shown earlier that for K = 3 and K = 4 the coefﬁcients are binomial coefﬁcients. Therefore the number of terms at any K and power k are the binomial coefﬁcients, C(K, k) = K k .

Note that the order of the indices in [ΠK i=2F c(i)]Fc(1) must be ordered, such that c(k) > c(k 1) > > c(1). Where the indices c(i, k) are given by any subset of the integers SK = {1, 2, . . . , K}. Of course, the number of unordered subsets with cardinality k from a set of cardinality K is K k . To write down a term of power k explicitly in the ﬁrst order Taylor expansion, we ﬁrst choose a unordered subset of k indices from SK then we order the indices to form σc = {c(k), . . . , c(1)}. Then the output after K residual units with input xi is the sum over all these subsets

i=2 F c(i)(x1, W c(i)) Fc(1)(x1, W c(1)), (10)

where P(SK) denotes the power set of SK. Note that when σc is empty, the right hand side gives the identity mapping. This is the same as Equation (3). Setting all weights to be the same gives the form in Equation 4.

A.2 SECOND PROOF OF THEOREM 1, BERNOULLI PROCESS

The series of index reduction operations can be identiﬁed with a Bernoulli process with parameters K and p = 0.5. Each term in Equation (3) arises from a realization of the Bernoulli process. Summing over terms from all possible realizations results in Equation (3). Recall that to express x K+1 in terms of x1 similar to Equation (3), we need K index reduction operations. Let XK:1 := {XK, XK 1, . . . , X1} be a Bernoulli process, where Xi B(K, p = 0.5). Then the realizations Xi = 0 represents the power of a term remains the same after an index reduction, and Xi = 1 denotes an increase in the power of a term by one. For example, consider K = 2, the terms corresponding to the realizations of the Bernoulli process X2:1 = {X2, X1} are

{0, 0} x1 {1, 0} F2 {0, 1} F1 {1, 1} F 2F1

One sees that x3 can be obtained by summing over all terms corresponding to all realizations of X3:1. This generalizes to XK:1 for x K+1. The probability of a term having power k is 2 K K k . Since the total number of terms is 2K, the number of terms having power k is the binomial coefﬁcient K k . If we let σc to be the term corresponding to a realization of XK:1, then consecutive Taylor expansions corresponds to summing over all σc and Equation (3) follows.