# discovering_neural_wirings__0532deb4.pdf Discovering Neural Wirings Mitchell Wortsman1,2, Ali Farhadi1,2,3, Mohammad Rastegari1,3 1PRIOR @ Allen Institute for AI, 2University of Washington, 3XNOR.AI mitchnw@cs.washington.edu, {ali, mohammad}@xnor.ai The success of neural networks has driven a shift in focus from feature engineering to architecture engineering. However, successful networks today are constructed using a small and manually defined set of building blocks. Even in methods of neural architecture search (NAS) the network connectivity patterns are largely constrained. In this work we propose a method for discovering neural wirings. We relax the typical notion of layers and instead enable channels to form connections independent of each other. This allows for a much larger space of possible networks. The wiring of our network is not fixed during training as we learn the network parameters we also learn the structure itself. Our experiments demonstrate that our learned connectivity outperforms hand engineered and randomly wired networks. By learning the connectivity of Mobile Net V1 [12] we boost the Image Net accuracy by 10% at 41M FLOPs. Moreover, we show that our method generalizes to recurrent and continuous time networks. Our work may also be regarded as unifying core aspects of the neural architecture search problem with sparse neural network learning. As NAS becomes more fine grained, finding a good architecture is akin to finding a sparse subnetwork of the complete graph. Accordingly, DNW provides an effective mechanism for discovering sparse subnetworks of predefined architectures in a single training run. Though we only ever use a small percentage of the weights during the forward pass, we still play the so-called initialization lottery [8] with a combinatorial number of subnetworks. Code and pretrained models are available at https://github.com/allenai/dnw while additional visualizations may be found at https://mitchellnw.github.io/blog/2019/dnw/. 1 Introduction Deep neural networks have shifted the prevailing paradigm from feature engineering to feature learning. The architecture of deep neural networks, however, must still be hand designed in a process known as architecture engineering. A myriad of recent efforts attempt to automate the process of the architecture design by searching among a set of smaller well-known building blocks [30, 34, 37, 19, 2, 20]. While methods of search range from reinforcement learning to gradient based approaches [34, 20], the space of possible connectivity patterns is still largely constrained. NAS methods explore wirings between predefined blocks, and [28] learns the recurrent structure of CNNs. We believe that more efficient solutions may arrive from searching the space of wirings at a more fine grained level, i.e. single channels. In this work, we consider an unconstrained set of possible wirings by allowing channels to form connections independent of each other. This enables us to discover a wide variety of operations (e.g. depthwise separable convs [12], channel shuffle and split [36], and more). Formally, we treat the network as a large neural graph where each each node processes a single channel. One key challenge lies in searching the space of all possible wirings the number of possible sub-graphs is combinatorial in nature. When considering thousands of nodes, traditional search methods are either prohibitive or offer approximate solutions. In this paper we introduce a simple 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Figure 1: Dynamic Neural Graph: A 3-layer perceptron (left) can be expressed by a dynamic neural graph with 3 time steps (right). and efficient algorithm for discovering neural wirings (DNW). Our method searches the space of all possible wirings with a simple modification of the backwards pass. Recent work in randomly wired neural networks [35] aims to explore the space of novel neural network wirings. Intriguingly, they show that constructing neural networks with random graph algorithms often outperforms a manually engineered architecture. However, these wirings are fixed at training. Our method for discovering neural wirings is as follows: First, we consider the sole constraint that that the total number of edges in the neural graph is fixed to be k. Initially we randomly assign a weight to each edge. We then choose the weighted edges with the highest magnitude and refer to the remaining edges as hallucinated. As we train, we modify the weights of all edges according to a specified update rule. Accordingly, a hallucinated edge may strengthen to a point it replaces a real edge. We tailor the update rule so that when swapping does occur, it is beneficial. We consider the application of DNW for static and dynamic neural graphs. In the static regime each node has a single output and the graphical structure is acyclic. In the case of a dymanic neural graph we allow the state of a node to vary with time. Dymanic neural graphs may contain cycles and express popular sequential models such as LSTMs [11]. As dymanic neural graphs are strictly more expressive than static neural graphs, they can also express feed-forward networks (as in Figure 1). Our work may also be regarded as a unification between the problem of neural architecture search and sparse neural network learning. As NAS becomes less restrictive and more fine grained, finding a good architecture is akin to finding a sparse sub-network of the complete graph. Accordingly, DNW provides an effective mechanism for discovering sparse networks in a single training run. The Lottery Ticket Hypothesis [8, 9] demonstrates that dense feed-forward neural networks contain so-called winning-tickets. These winning-tickets are sparse subnetworks which, when reset to their initialization and trained in isolation, reach an accuracy comparable to their dense counterparts. This hypothesis articulate an advantage of overparameterization during training having more parameters increases the chance of winning the initialization lottery. We leverage this idea to train a sparse neural network without retraining or fine-tuning. Though we only ever use a small percentage of the weights during the forward pass, we still play the lottery with a combinatorial number of sub-networks. We demonstrate the efficacy of DNW on small and large scale data-sets, and for feed-forward, recurrent, continuous, and sparse networks. Notably, we augment Mobile Net V1 [12] with DNW to achieve a 10% improvement on Image Net [5] from the hand engineered Mobile Net V1 at 41M FLOPs1. 2 Discovering Neural Wirings In this section we describe our method for jointly discovering the structure and learning the parameters of a neural network. We first consider the algorithm in a familiar setting, a feed-forward neural network, which we abstract as a static neural graph. We then present a more expressive dynamic neural graph which extends to discrete and continuous time and generalizes feed-forward, recurrent, and continuous time neural networks. 1We follow [36, 22] and define FLOPS as the number of Multiply Adds. ,("+-*' .%+%*' [3x3-conv2D, stride=1] [3x3-conv2D, stride=2] Output zero-padded Figure 2: An example of a dynamic (left) and static (right) neural graph. Details in Section 2.3. 2.1 Static Neural Graph A static neural graph is a directed acyclic graph G = (V, E) consisting of nodes V and edges E V V. The state of a node v 2 V is given by the random variable Zv. At each node v we apply a function f v and with each edge (u, v) we associate a weight wuv. In the case of a multi-layer perceptron, f is simply a parameter-free non-linear activation like Re LU [17]. For any set A V we let ZA denote (Zv)v2A and so ZV is the state of all nodes in the network. V contains a subset of input nodes V0 with no parents and output nodes VE with no children. The input data X px flows into the network through V0 as ZV0 = gφ(X) for a function g which may have parameters φ. Similarly, the output of the network ˆY is given by h (ZVE). (u,v)2E wuv Zu φ (X) v 2 V0. For brevity, we let Iv denote the input" to node v, where Iv may be expressed wuv Zu. (2) In this work we consider the case where the input and output of each node is a two-dimensional matrix, commonly referred to as a channel. Each node performs a non-linear activation followed by normalization and convolution (which may be strided to reduce the spatial resolution). As in [35], we no longer conform to the traditional notion of layers" in a deep network. The combination of a separate 3 3 convolution for each channel (depthwise convolution) followed by a 1 1 convolution (pointwise convolution) is often referred to as a depthwise seperable convolution, and is essential in efficient network design [12, 22]. With a static neural graph this process may be interpreted equivalently as a 3 3 convolution at each node followed by information flow on a complete bipartite graph. 2.2 Discovering a k-Edge neural graph We now outline our method for discovering the edges of a static neural graph subject to the constraint that the total number of edges must not exceed k. We consider a set of real edges E and a set of hallucinated edges Ehal = V V \ E. The real edge set is comprised of the k-edges which have the largest magnitude weight. As we allow the magnitude of the weights in both sets to change throughout training the edges in Ehal may replace those in E. Consider a hallucinated edge (u, v) 62 E. If the gradient is pushing Iv in a direction which aligns with Zu, then our update rule strengthens the magnitude of the weight wuv. If this alignment happens consistently then wuv will be eventually be strong enough to enter the real edge set E. As the total Algorithm 1 DNW-Train(V, V0, VE, gφ, h , {f v}v2V, pxy, k, L) 1: for each pair of nodes (u, v) such that u < v do . Initialize 2: Initialize wuv by independently sampling from a uniform distribution. 3: for each training iteration do 4: Sample mini batch of data and labels (X, Y) = {(Xi, Yi)} using pxy . Sample data 5: E {(u, v) : |wuv| } where is chosen so that |E| = k . Choose edges (u,v)2E wuv Zu φ (X) v 2 V0 . Forward pass 7: ˆY = h ({Zv}v2VE) . Compute output 8: Update φ, { v}v2V, via SGD & Backprop [26] using loss L 9: for each pair of nodes (u, v) such that u < v do . Update edge weights 10: wuv wuv + . Recall Iv = P (u,v)2E wuv Zu Backward Forward ℰ= $:< $:< J $:< $:< + 9:, N =ℒ Figure 3: Gradient flow: On the forward pass we use only on the real edges. On the backwards pass we allow the gradient to flow to but not through the hallucinated edges (as in Algorithm 1). number of edges is conserved, when (u, v) enters the edge set E another edge is removed and placed in Ehal. This procedure is detailed by Algorithm 1, where V is the node set, V0, VE are the input and output node sets, gφ, h and {f v}v2V are the input, output, and node functions, pxy is the data distribution, k is the number of edges in the graph and L is the loss. In practice we may also include a momentum and weight decay2 term in the weight update rule (line 10 in Algorithm 1). In fact, the weight update rule looks nearly identical to that in traditional SGD & Backprop but for one key difference: we allow the gradient to flow to edges which did not exist during the forward pass. Importantly, we do not allow the gradient to flow through these edges and so the rest of the parameters update as in traditional SGD & Backprop. This gradient flow is illustrated in Figure 3. Under certain conditions we formally show that swapping an edge from Ehal to E decreases the loss L. We first consider the simple case where the hallucinated edge (i, k) replaces (j, k) 2 E. In Section C we discuss the proof to a more general case. We let w to denote the weight w after the weight update rule wuv = wuv + . We assume that is small enough so that sign( w) = sign(w). Claim: Assume L is Lipschitz continuous. There exists a learning rate > 0 such that for 2 (0, ) the process of swapping (i, k) for (j, k) will decrease the loss on the mini-batch when the state of the nodes are fixed and |wik| < |wjk| but | wik| > | wjk|. 2Weight decay [18] may in fact be very helpful for eliminating dead ends. Proof. Let A be value of Ik after the update rule if (j, k) is replaced with (i, k). Let B be the state of Ik after the update rule if we do not allow for swapping. A and B are then given by A = wik Zi + (u,k)2E, u6=i,j wuk Zu, B = wjk Zj + (u,k)2E, u6=i,j wuk Zu. (3) Additionally, let g = @L @Ik be the direction in which the loss most steeply descends with respect to Ik. By Lemma 1 (Section D of the Appendix) it suffices to show that moving Ik towards A is more aligned with g then moving Ik towards B. Formally we wish to show that h A Ik, gi h B Ik, gi (4) which simplifies to wik h Zi, gi wjk h Zj, gi (5) () wik( wik wik) wjk( wjk wjk). (6) In the case where wik and ( wik wik) have the same sign but wjk and ( wjk wjk) have different signs the inequality immediately holds. This corresponds to the case where wik increases in magnitude but wjk decreases in magnitude. The opposite scenario (wik decreases in magnitude but wjk increases) is impossible since |wik| < |wjk| but | wik| > | wjk|. We now consider the scenario where both sides of the inequality (equation 6) are positive. Simplifying further we obtain ( wjkwjk wikwik) (7) and are now able to identify a range for such that the inequality above is satisfied. By assumption the right hand side is less than 0 and sign( w) = sign(w) so ww = | w||w|. Accordingly, it suffices to show that | wjk||wjk| | wik||wik| 0. (8) If we let = |wjk| |wik| and = sup{ : | wik| | wjk| + | wjk||wik| 1}, then for 2 (0, ) | wjk||wjk| | wik||wik| | wjk| @|wjk| |wik| | {z } = the inequality (equation 7) is satisfied. Here we are implicitly using our assumption that the gradient is bounded and we may tune to control the magnitude | wik| | wjk|. In the case where = inf{ : | wik| > | wjk|} the right hand side of equation 7 becomes 0 while the left hand side is > 0. In Section E of the appendix we discuss the effect of v on wuv. In Section F of the Appendix, we show that the update rule is equivalently a straight-through estimator [1]. 2.3 Dynamic Neural Graph We now consider a more general setting where the state of each node Zv(t) may vary through time. We refer to this model as a dynamic neural graph. The initial conditions of a dynamic neural graph are given by φ (X) v 2 V0 0 v 2 V \ V0 where V0 is a designated set of input nodes, which may now have parents. Discrete Time Dynamics: For a discrete time neural graph we consider times 2 {0, 1, ..., L}. The dynamics are then given by Zv( + 1) = f v and the network output is ˆY = h (ZVE(L)). We may express equation 11 more succinctly as ZV( + 1) = f (AGZV( ), ) (12) where ZV( ) = (Zv( ))v2V, f (z, ) = (f v(zv, ))v2V, and AG is the weighted adjacency matrix for graph G. Equation 12 suggests the following interpretation: At each time step we send information through the edges using AG then apply a function at each node. Continuous Time Dynamics: As in [3], we consider the case where t may take on a continuous range of values. We then arrive at dynamics given by r ZV(t) = f (AGZV(t), t) . (13) Interestingly, if V0 is a strict subset of V we uncover an Augmented Neural ODE [7]. The discrete time case is unifying in the sense that it may also express any static neural graph. In Figure 1 we illustrate than an MLP may also be expressed by a discrete time neural graph. Additionally, the discrete time dynamics are able to capture sequential models such as LSTMs [11], as long as we allow input to flow into V0 at any time. In continuous time it is not immediately obvious how to incorporate strided convolutions. One approach is to keep the same spatial resolution throughout and pad with zeros after applying strided convolutions. This design is illustrated by Figure 2. We may also apply Algorithm 1 to learn the structure of dynamic neural graphs. One may use backpropogation through time [33] and the adjoint-sensitivity method [3] for optimization in the discrete and continuous time settings respectively. In Section 3.1, we demonstrate empirically that our method performs better than a random graph, though we do not formally justify the application of our algorithm in this setting. 2.4 Implementation details for Large Scale Experiments For large scale experiments we do not consider the dynamic case as optimization is too expensive. Accordingly, we now present our method for constructing a large and efficient static neural graph. With this model we may jointly learn the structure of the graph along with the parameters on Image Net [5]. As illustrated by Table 5 our model closely follows the structure of Mobile Net V1 [12], and so we refer to it as Mobile Net V1-DNW. We consider a separate neural graph for each spatial resolution the output of graph Gi is the input of graph Gi+1. For width multiplier [12] d and spatial resolution s s we constrain Mobile Net V1-DNW to have the same number of edges for resolution s s as the corresponding Mobile Net V1 d. We use a slightly smaller width multiplier to obtain a model with similar FLOPs as we do not explicitly reduce the number of depthwise convolutions in Mobile Net V1-DNW. However, we do find that neurons often die (have no output) and we may then skip the depthwise convolution during inference. Note that if we interpret a pointwise convolution with c1 input channels and c2 output channels as a complete bipartite graph then the number of edges is simply c1 c2. We also constrain the longest path in graph G to be equivalent to the number of layers of the corresponding Mobile Net V1. We do so by partitioning the nodes V into blocks B = {B0, ..., BL 1} where B0 is the input nodes V0, BL 1 is output nodes VE, and we only allow edges between nodes in Bi and Bj if i < j. The longest path in a graph with L blocks is then L 1. Splitting the graph into blocks also improves efficiency as we may operate on one block at a time. The structure of Mobile Net V1 may be recovered by considering a complete bipartite graph between adjacent blocks. The operation f v at each non-output node is a batch-norm [14] (2 parameters), Re LU [17], 3 3 convolution (9 parameters) triplet. There are no operations at the output nodes. When the spatial resolution decreases in Mobile Net V1 we change the convolutional stride of the input nodes to 2. In models denoted Mobile Net V1-DNW-Small ( d) we also limit the last fully connected (FC) layer to have the same number of edges as the FC layer in Mobile Net V1 ( d). In the normal setting of Mobile Net V1-DNW we do not modify the last FC layer. 3 Experiments In this section we demonstrate the effectiveness of DNW for image classification in small and large scale settings. We begin by comparing our method with a random wiring on a small scale dataset Table 1: Testing a tiny (41k parameters) classifier on CIFAR-10 [16] in static and dynamic settings shown as mean and standard deviation (std) over 5 runs. Model Accuracy Static (RG) 76.1 0.5% Static (DNW) 80.9 0.6% Discrete Time (RG) 77.3 0.7% Discrete Time (DNW) 82.3 0.6% Continuous (RG) 78.5 1.2% Continuous (DNW) 83.1 0.3% Table 2: Other methods for discovering wirings (using the architecture described in Table 5) tested on CIFAR-10 shown as mean and std over 5 runs. Models with first require the complete graph to be trained. Model Accuracy Mobile Net V1 ( 0.25) 86.3 0.2% Mobile Net V1-RG( 0.225) 87.2 0.1% No Update Rule 86.7 0.5% L1 + Anneal 84.3 0.6% TD = 0.95 89.2 0.4% Lottery Ticket (one-shot) 87.9 0.3% Fine Tune = 0.1 89.4 0.2% Fine Tune = 0.01 89.7 0.1% Fine Tune = 0.001 88.7 0.2% Mobile Net V1-DNW( 0.225) 89.7 0.2% and model. This allows us to experiment in static, discrete time, and continuous settings. Next we explore the use of DNW at scale with experiments on Image Net [5] and compare DNW with other methods of discovering network structures. Finally we use our algorithm to effectively train sparse neural networks without retraining or fine-tuning. Throughout this section we let RG denote our primary baseline a randomly wired graph. To construct a randomly wired graph with k-edges we assign a uniform random weight to each edge then pick the k edges with the largest magnitude weights. As shown in [35], random graphs often outperform manually designed networks. 3.1 Small Scale Experiments For Static and Dynamic Neural Graphs We begin by training tiny classifiers for the CIFAR-10 dataset [16]. Our initial aim is not to achieve state of the art performance but instead to explore DNW in the static, discrete, and continuous time settings. As illustrated by Table 1, our method outperforms a random graph by a large margin. The image is first downsampled3 then each channel is given as input to a node in a neural graph. The static graph uses 5 blocks and the discrete time graph uses 5 time steps. For the continuous case we backprop through the operation of an adaptive ODE solver4. The models have 41k parameters. At each node we perform Instance Normalization [32], Re LU, and a 3 3 single channel convolution. 3.2 Image Net Classification For large scale experiments on Image Net [5] we are limited to exploring DNW in the static case (recurrent and continuous time networks are more expensive to optimize due to lack of parallelization). Although our network follows the simple structure of Mobile Net V1 [12] we are able to achieve higher accuracy than modern networks which are more advanced and optimized. Notably, Mobile Net V2 [27] extends Mobile Net V1 by adding residual connections and linear bottlenecks and Shuffle Net [36, 22] introduces channel splits and channel shuffles. The results of the large scale experiments may be found in Table 3. As standard, we have divided the results of Table 3 to consider models which have similar FLOPs. In the more sparse case ( 41M FLOPs) we are able to use DNW to boost the performance of Mobile Net V1 by 10%. Though random graphs perform extremely well we still observe a 7% boost in performance. In each experiment we train for 250 epochs using Cosine Annealing as the learning rate scheduler with initial learning rate 0.1, as in [35]. Models using random graphs have considerably more FLOPs as nearly all depthwise convolutions must be performed. DNW allows neurons to die and we may therefore skip many operations. 3We use two 3 3 strided convolutions. The first is standard while the second is depthwise-separable. 4We use a 5th order Runge-Kutta method [29] as implemented by [3] (from t = 0 to 1 with tolerance 0.001). Table 3: Image Net Experiments (see Section 2.4 for more details). Models with use the implementations of [22]. Models with multiples asterisks use different image resolutions so that the FLOPs is comparable (see Table 8 in [22] for more details). Model Params FLOPs Accuracy Mobile Net V1 ( 0.25) [12] 0.5M 41M 50.6% X-4 Mobile Net V1 [25] > 50M 54.0% Mobile Net V2 ( 0.15) [27] 39M 44.9% Mobile Net V2 ( 0.4) 43M 56.6% Dense Net ( 0.5) [13] 42M 41.1% Xception ( 0.5) [4] 40M 55.1% Shuffle Net V1 ( 0.5, g = 3) [36] 38M 56.8% Shuffle Net V2 ( 0.5) [22] 1.4M 41M 60.3% Mobile Net V1-RG( 0.225) 1.2M 55.7M 53.3% Mobile Net V1-DNW-Small ( 0.15) 0.24M 22.1M 50.3% Mobile Net V1-DNW-Small ( 0.225) 0.4M 41.2M 59.9% Mobile Net V1-DNW( 0.225) 1.1M 42.1M 60.9% Mnas Net-search1 [30] 1.9M 65M 64.9% Mobile Net V1-DNW( 0.3) 1.3M 66.7M 65.0% Mobile Net V1 ( 0.5) 1.3M 149M 63.7% Mobile Net V2 ( 0.6) 141M 66.6% Mobile Net V2 ( 0.75) 145M 67.9% Dense Net ( 1) 142M 54.8% Xception ( 1) 145M 65.9% Shuffle Net V1 ( 1, g = 3) 140M 67.4% Shuffle Net V2 ( 1) 2.3M 146M 69.4% Mobile Net V1-RG( 0.49) 1.8M 170M 64.1% Mobile Net V1-DNW( 0.49) 1.8M 154M 70.4% 3.3 Related Methods We compare DNW with various methods for discovering neural wirings. In Table 2 we use the structure of Mobile Net V1-DNW but try other methods which find k-edge sub-networks. The experiments in Table 2 are conducted using CIFAR-10 [16]. We train for 160 epochs using Cosine Annealing as the learning rate scheduler with initial learning rate = 0.1 unless otherwise noted. The Lottery Ticket Hypothesis: The authors of [8, 9] offer an intriguing hypothesis: sparse subnetworks may be trained in isolation when reset to their initialization. However, their method for finding so-called winning tickets is quite expensive as it requires training the full graph from scratch. We compare with one-shot pruning from [9]. One-shot pruning is more comparable in training FLOPS than iterative pruning [8], though both methods are more expensive in training FLOPS than DNW. After training the full network Gfull (i.e. no edges pruned) the optimal sub-network Gk with k-edges is chosen by taking the weights with the highest magnitude. In the row denoted Lottery Ticket we retrain Gk using the initialization of Gfull. We also initialize Gk with the weights of Gfull after training denoted by FT for fine-tune (we try different initial learning rates ). Though these experiments perform comparably with DNW, their training is more expensive as the full graph must initially be trained. Exploring Randomly Wired Networks for Image Recognition: The authors of [35] explore a more diverse set of connectivity patterns through the lens of randomly wired neural networks." They achieve impressive performance on Image Net [5] using random graph algorithms to generate the structure of a neural network. Their network connectivity, however, is fixed during training. Throughout this section we have a random graph (denoted RG) as our primary baseline as in [35] we have seen that random graphs outperform hand-designed networks. No Update Rule: In this ablation on DNW we do not apply the update rule to the hallucinated edges. An edge may only leave the hallucinated edge set if the magnitude of a real edge is sufficiently decreased. This experiment demonstrates the importance of the update rule. L1 + Anneal: We experiment with a simple pruning technique start with a fully connected graph and remove edges by magnitude throughout training until there are only k remaining. We found that accuracy was much better if we added an L1 regularization term. Table 4: Training a tuned version of Res Net50 on Image Net with modern optimization techniques, as in Appendix C of [6]. For All Layers Sparse, every layer has a fixed sparsity. In contrast, we leave the very first convolution dense for First Layer Dense. The parameters in the first layer constitute only 0.04% of the total network. Method Weights (%) Top-1 Accuracy Top-5 Accuracy Sparse Networks from Scratch [6] 10% 72.9% 91.5% Ours - All Layers Sparse 10% 74.0% 92.0% Ours - First Layer Dense 10% 75.0% 92.5% Sparse Networks from Scratch [6] 20% 74.9% 92.5% Ours - All Layers Sparse 20% 76.2% 93.0% Ours - First Layer Dense 20% 76.6% 93.4% Sparse Networks from Scratch [6] 30% 75.9% 92.9% Ours - All Layers Sparse 30% 76.9% 93.4% Ours - First Layer Dense 30% 77.1% 93.5% Sparse Networks from Scratch [6] 100% 77.0% 93.5% Ours - Dense Baseline 100% 77.5% 93.7% Targeted Dropout: The authors of [10] present a simple and effective method for training a network which is robust to subsequent pruning. Their method outperforms variational dropout [23] and L0 pruning [21]. We compare with Weight Dropout/Pruning from [10], which we denote as TD. Section B of the Appendix contains more information, experimental details, and hyperparameter trials for the Targeted Dropout experiments, though we provide the best result in Table 2. Neural Architecture Search: As illustrated by Table 3, our network (with a very simple Mobile Net V1 like structure) is able to achieve comparable accuracy to an expensive method which performs neural architecture search using reinforcement learning [30]. 3.4 Training Sparse Neural Networks We may apply our algorithm for Discovering Neural Wirings to the task of training sparse neural networks. Importantly, our method requires no fine-tuning or retraining to discover a sparse subnetworks the sparsity is maintained throughout training. This perspective was guided by the the work of Dettmers and Zettelmoyer in [6], though we would like to highlight some differences. Their work enables faster training, though our backwards pass is still dense. Moreover, their work allows for a redistribution of parameters across layers whereas we consider a fixed sparsity per layer. Our algorithm for training a sparse neural network is similar to Algorithm 1, though we implicitly treat each convolution as a separate graph where each parameter is an edge. For each convolutional layer on the forwards pass, we use the top k% of the parameters chosen by magnitude. On the backwards pass we allow the gradient to flow to, but not through, all weights that were zeroed out on the forwards pass. All weights receive gradients as if they existed on the forwards pass, regardless of if they were zeroed out. As in [6] we leave the biases and batchnorm dense. We compare with the result in Appendix C of [6], as we also use a tuned version of a Res Net50 that uses modern optimization techniques such as cosine learning rate scheduling and warmup5. We train for 100 epochs and showcase our results in Table 4. 4 Conclusion We present a novel method for discovering neural wirings. With a simple algorithm we demonstrate a significant boost in accuracy over randomly wired networks. We benefit from overparameterization during training even when the resulting model is sparse. Just as in [35], our networks are free from the typical constraints of NAS. This work suggests exciting directions for more complex and efficient methods of discovering neural wirings. 5We adapt the code from https://github.com/NVIDIA/Deep Learning Examples/tree/master/ Py Torch/Classification/RN50v1.5, using the exact same hyperparameters but training for 100 epochs. Acknowledgments We thank Sarah Pratt, Mark Yatskar and the Beaker team. We also thank Tim Dettmers for his assistance and guidance in the experiments regarding sparse networks. This work is in part supported by DARPA N66001-19-2-4031, NSF IIS-165205, NSF IIS-1637479, NSF IIS-1703166, Sloan Fellowship, NVIDIA Artificial Intelligence Lab, the Allen Institute for Artificial Intelligence, and the AI2 fellowship for AI. Computations on beaker.org were supported in part by credits from Google Cloud. [1] Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. Ar Xiv, abs/1308.3432, 2013. [2] Han Cai, Ligeng Zhu, and Song Han. Proxyless NAS: Direct neural architecture search on target task and hardware. In ICLR, 2019. [3] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential equations. In Neur IPS, 2018. [4] François Chollet. Xception: Deep learning with depthwise separable convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1800 1807, 2017. [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR 2009, 2009. [6] Tim Dettmers and Luke S. Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. Ar Xiv, abs/1907.04840, 2019. [7] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural odes. Co RR, abs/1904.01681, [8] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR 2019, 2019. [9] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery ticket hypothesis at scale. Co RR, abs/1903.01611, 2019. [10] Aidan N. Gomez, Ivan Zhang, Kevin Swersky, Yarin Gal, and Geoffrey E. Hinton. Learning sparse networks using targeted dropout, 2019. [11] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9:1735 1780, [12] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. Co RR, abs/1704.04861, 2017. [13] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261 2269, 2017. [14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. [15] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. Ar Xiv, abs/1611.01144, 2016. [16] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60:84 90, 2012. [18] Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In NIPS, 1991. [19] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19 34, 2018. [20] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. Co RR, abs/1806.09055, 2019. [21] Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l0 regularization. Co RR, abs/1712.01312, 2018. [22] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018. [23] Dmitry Molchanov, Arsenii Ashukha, and Dmitry P. Vetrov. Variational dropout sparsifies deep neural networks. In ICML, 2017. [24] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in Py Torch. In NIPS Autodiff Workshop, 2017. [25] Ameya Prabhu, Girish Varma, and Anoop M. Namboodiri. Deep expander networks: Efficient deep networks from graph theory. In ECCV, 2017. [26] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back- propagating errors. Nature, 323:533 536, 1986. [27] Mark B. Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510 4520, 2018. [28] Pedro H. P. Savarese and Michael Maire. Learning implicitly recurrent cnns through parameter sharing. Ar Xiv, abs/1902.09701, 2019. [29] F Shampine, Lawrence. Some practical runge-kutta formulas. Math. Comput., 46(173):135 150, January [30] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V. Le. Mnasnet: Platform-aware neural architecture search for mobile. Co RR, abs/1807.11626, 2018. [31] Yuandong Tian, Tina Jiang, Qucheng Gong, and Ari S. Morcos. Luck matters: Understanding training dynamics of deep relu networks. Ar Xiv, abs/1905.13405, 2019. [32] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. Co RR, abs/1607.08022, 2016. [33] P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550 1560, Oct 1990. [34] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. ar Xiv preprint ar Xiv:1812.03443, 2018. [35] Saining Xie, Alexander Kirillov, Ross B. Girshick, and Kaiming He. Exploring randomly wired neural networks for image recognition. Co RR, abs/1904.01569, 2019. [36] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolu- tional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6848 6856, 2018. [37] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. ar Xiv preprint ar Xiv:1611.01578, 2016.