# tensor_switching_networks__e08344fc.pdf Tensor Switching Networks Chuan-Yung Tsai , Andrew Saxe , David Cox Center for Brain Science, Harvard University, Cambridge, MA 02138 {chuanyungtsai,asaxe,davidcox}@fas.harvard.edu We present a novel neural network algorithm, the Tensor Switching (TS) network, which generalizes the Rectified Linear Unit (Re LU) nonlinearity to tensor-valued hidden units. The TS network copies its entire input vector to different locations in an expanded representation, with the location determined by its hidden unit activity. In this way, even a simple linear readout from the TS representation can implement a highly expressive deep-network-like function. The TS network hence avoids the vanishing gradient problem by construction, at the cost of larger representation size. We develop several methods to train the TS network, including equivalent kernels for infinitely wide and deep TS networks, a one-pass linear learning algorithm, and two backpropagation-inspired representation learning algorithms. Our experimental results demonstrate that the TS network is indeed more expressive and consistently learns faster than standard Re LU networks. 1 Introduction Deep networks [1, 2] continue to post impressive successes in a wide range of tasks, and the Rectified Linear Unit (Re LU) [3, 4] is arguably the most used simple nonlinearity. In this work we develop a novel deep learning algorithm, the Tensor Switching (TS) network, which generalizes the Re LU such that each hidden unit conveys a tensor, instead of scalar, yielding a more expressive model. Like the Re LU network, the TS network is a linear function of its input, conditioned on the activation pattern of its hidden units. By separating the decision to activate from the analysis performed when active, even a linear classifier can reach back across all layers to the input of the TS network, implementing a deep-network-like function while avoiding the vanishing gradient problem [5], which can otherwise significantly slow down learning in deep networks. The trade-off is the representation size. We exploit the properties of TS networks to develop several methods suitable for learning in different scaling regimes, including their equivalent kernels for SVMs on small to medium datasets, a one-pass linear learning algorithm which visits each data point only once for use with very large but simpler datasets, and two backpropagation-inspired representation learning algorithms for more generic use. Our experimental results show that TS networks are indeed more expressive and consistently learn faster than standard Re LU networks. Related work is briefly summarized as follows. With respect to improving the nonlinearities, the idea of severing activation and analysis weights (or having multiple sets of weights) in each hidden layer has been studied in [6, 7, 8]. Reordering activation and analysis is proposed by [9]. On tackling the vanishing gradient problem, tensor methods are used by [10] to train single-hidden-layer networks. Convex learning and inference in various deep architectures can be found in [11, 12, 13] too. Finally, conditional linearity of deep Re LU networks is also used by [14], mainly to analyze their performance. In comparison, the TS network does not simply reorder or sever activation and analysis within each hidden layer. Instead, it is a cross-layer generalization of these concepts, which can be applied with most of the recent deep learning architectures [15, 9], not only to increase their expressiveness, but also to help avoiding the vanishing gradient problem (see Sec. 2.3). Equal contribution. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. Scalar Switching Re LU X1 Linear Readout y Tensor Switching Re LU Z1 Linear Readout y Figure 1: (Left) A single-hidden-layer standard (i.e. Scalar Switching) Re LU network. (Right) A single-hidden-layer Tensor Switching Re LU network, where each hidden unit conveys a vector of activities inactive units (top-most unit) convey a vector of zeros while active units (bottom two units) convey a copy of their input. 2 Tensor Switching Networks In the following we first construct the definition of shallow (single-hidden-layer) TS networks, then generalize the definition to deep TS networks, and finally describe their qualitative properties. For simplicity, we only show fully-connected architectures using the Re LU nonlinearity. However, other popular nonlinearities, e.g. max pooling and maxout [16], in addition to Re LU, are also supported in both fully-connected and convolutional architectures. 2.1 Shallow TS Networks The TS-Re LU network is a generalization of standard Re LU networks that permits each hidden unit to convey an entire tensor of activity (see Fig. 1). To describe it, we build up from the standard Re LU network. Consider a Re LU layer with weight matrix W1 Rn1 n0 responding to an input vector X0 Rn0. The resulting hidden activity X1 Rn1 of this layer is X1 = max (0n1, W1X0) = H (W1X0) (W1X0) where H is the Heaviside step function, and denotes elementwise product. The rightmost equation splits apart each hidden unit s decision to activate, represented by the term H (W1X0), from the information (i.e. result of analysis) it conveys when active, denoted by W1X0. We then go one step further to rewrite X1 as H (W1X0) X0 | {z } Z1 where we have made use of the following tensor operations: vector-tensor cross product C = A B = ci,j,k,... = aibj,k,..., tensor-matrix Hadamard product C = A B = c...,j,i = a...,j,ibj,i and tensor summative reduction C = A 1n = c...,k,j = Pn i=1 a...,k,j,i. In (1), the input vector X0 is first expanded into a new matrix representation Z1 Rn1 n0 with one row per hidden unit. If a hidden unit is active, the input vector X0 is copied to the corresponding row. Otherwise, the row is filled with zeros. Finally, this expanded representation Z1 is collapsed back by projection onto W1. The central idea behind the TS-Re LU network is to learn a linear classifier directly from the rich, expanded representation Z1, rather than collapsing it back to the lower dimensional X1. That is, in a standard Re LU network, the hidden layer activity X1 is sent through a linear classifier f X (WXX1) trained to minimize some loss function LX (f X). In the TS-Re LU network, by contrast, the expanded representation Z1 is sent to a linear classifier f Z (WZ vec (Z1)) with loss function LZ (f Z). Each TS-Re LU neuron thus transmits a vector of activities (a row of Z1), compared to a standard Re LU neuron that transmits a single scalar (see Fig. 1). Because of this difference, in the following we call the standard Re LU network a Scalar Switching Re LU (SS-Re LU) network. 2.2 Deep TS Networks The construction given above generalizes readily to deeper networks. Define a nonlinear expansion operation as X W = H (WX) X and linear contraction operation as Z W = (Z W) 1n, such that (1) becomes Xl = ((Xl 1 Wl) Wl) 1nl 1 = Xl 1 Wl Wl for a given layer l with Xl Rnl and Wl Rnl nl 1. A deep SS-Re LU network with L layers may then be expressed as a sequence of alternating expansion and contraction steps, XL = X0 W1 W1 WL WL. (2) To obtain the deep TS-Re LU network, we further define the ternary expansion operation Z X W = H (WX) Z, such that the decision to activate is based on the SS-Re LU variables X, but the entire tensor Z is transmitted when the associated hidden unit is active. Let Z0 = X0. The l-th layer activity tensor of a TS network can then be written as Zl = H (Wl Xl 1) Zl 1 = Zl 1 Xl 1 Wl Rnl nl 1 n0. Thus compared to a deep SS-Re LU network, a deep TS-Re LU network simply omits the contraction stages, ZL = Z0 X0 W1 XL 1 WL. (3) Because there are no contraction steps, the order of Zl Rnl nl 1 n0 grows with depth, adding an additional dimension for each layer. One interpretation of this scheme is that, if a hidden unit at layer l is active, the entire tensor Zl 1 is copied to the appropriate position in Zl.1 Otherwise a tensor of zeros is copied. Another equivalent interpretation is that the input vector X0 is copied to a given position Zl(i, j, . . . , k, :) only if hidden units i, j, . . . , k at layers l, l 1, . . . , 1 respectively are all active. Otherwise, Zl(i, j, . . . , k, :) = 0n0. Hence activity propagation in the deep TS-Re LU network preserves the layered structure of a deep SS-Re LU network, in which a chain of hidden units across layers must activate for activity to propagate from input to output. 2.3 Properties The TS network decouples a hidden unit s decision to activate (as encoded by the activation weights {Wl}) from the analysis performed on the input when the unit is active (as encoded by the analysis weights WZ). This distinguishing feature leads to the following 3 properties. Cross-layer analysis. Since the TS representation preserves the layered structure of a deep network and offers direct access to the entire input (parcellated by the activated hidden units), a simple linear readout can effectively reach back across layers to the input and thus implicitly learns analysis weights for all layers at one time in WZ. Therefore it avoids the vanishing gradient problem by construction.2 Error-correcting analysis. As activation and analysis are severed, a careful selection of the analysis weights can clean up a certain amount of inexactitude in the choice to activate, e.g. from noisy or even random activation weights. While for the SS network, bad activation also implies bad analysis. Fine-grained analysis. To see this, we consider single-hidden-layer TS and SS networks with just one hidden unit. The TS unit, when active, conveys the entire input vector, and hence any full-rank linear map from input to output may be implemented. The SS unit, when active, conveys just a single scalar, and hence can only implement a rank-1 linear map between input and output. By choosing the right analysis weights, a TS network can always implement an SS network,3 but not vice versa. As such, it clearly has greater modeling capacity for a fixed number of hidden units. Although the TS representation is highly expressive, it comes at the cost of an exponential increase in the size of its representation with depth, i.e. Q l nl. This renders TS networks of substantial width and depth very challenging (except as kernels). But as we will show, the expressiveness permits TS networks to perform fairly well without having to be extremely wide and deep, and often noticeably better than SS networks of the same sizes. Also, TS networks of useful sizes still can be implemented with reasonable computing resources, especially when combined with techniques in Sec. 4.3. 3 Equivalent Kernels In this section we derive equivalent kernels for TS-Re LU networks with arbitrary depth and an infinite number of hidden units at each layer, with the aim of providing theoretical insight into how TS-Re LU is analytically different from SS-Re LU. These kernels represent the extreme of infinite (but unlearned) features, and might be used in SVM on datasets of small to medium sizes. 1For convolutional networks using max pooling, the convolutional-window-sized input patch winning the max pooling is copied. In other words, different nonlinearities only change the way the input is switched. 2It is in spirit similar to models with skip connections to the output [17, 18], although not exactly reducible. 3Therefore TS networks are also universal function approximators [19]. Linear SS L=1 SS L=2 SS L=3 TS L=1 TS L=2 TS L=3 Figure 2: Equivalent kernels as a function of the angle between unit-length vectors x and y. The deep SS-Re LU kernel converges to 1 everywhere as L , while the deep TSRe LU kernel converges to 1 at the origin and 0 everywhere else. Consider a single-hidden-layer TS-Re LU network with n1 hidden units in which each element of the activation weight matrix W1 Rn1 n0 is i.i.d. zero mean Gaussian with arbitrary standard deviation σ. The infinite-width random TS-Re LU kernel between two vectors x, y Rn0 is the dot product between their expanded representations (scaled by p 2/n1 for convenience) in the limit of infinite hidden units, k TS 1 (x, y) = limn1 vec p 2/n1 x W1 vec p 2/n1 y W1 = 2 E [H (w x) H (w y)] x y, where w N 0, σ2I is a n0-dimensional random Gaussian vector. The expectation is the probability that a randomly chosen vector w lies within 90 degrees of both x and y. Because w is drawn from an isotropic Gaussian, if x and y differ by an angle θ, then only the fraction π θ 2π of randomly drawn w will be within 90 degrees of both, yielding the equivalent kernel of a single-hidden-layer infinite-width random TS-Re LU network given in (5).4 k SS 1 (x, y) = k SS (θ) x y = 1 tan θ θ k TS 1 (x, y) = k TS (θ) x y = 1 θ Figure 2 compares (5) against the linear kernel and the single-hidden-layer infinite-width random SS-Re LU kernel (4) from [20] (see Linear, TS L = 1 and SS L = 1). It has two important qualitative features. First, it has discontinuous derivative at θ = 0, and hence a much sharper peak than the other kernels.5 Intuitively this means that a very close match counts for much more than a moderately close match. Second, unlike the SS-Re LU kernel which is non-negative everywhere, the TS-Re LU kernel still has a negative lobe, though it is substantially reduced relative to the linear kernel. Intuitively this means that being dissimilar to a support vector can provide evidence against a particular classification, but this negative evidence is much weaker than in a standard linear kernel. To derive kernels for deeper TS-Re LU networks, we need to consider the deeper SS-Re LU kernels as well, since its activation and analysis are severed, and the activation instead depends on its SS-Re LU counterpart. Based upon the recursive formulation from [20], first we define the zeroth-layer kernel k 0 (x, y) = x y and the generalized angle θ l = cos 1 k l (x, y)/ p k l (x, x) k l (y, y) , where denotes SS or TS. Then we can easily get k SS l+1 (x, y) = k SS θSS l k SS l (x, y),6 and k TS l+1 (x, y) = k TS θSS l k TS l (x, y), where k follows (4) or (5) accordingly. Figure 2 also plots the deep TS-Re LU and SS-Re LU kernels as a function of depth. The shape of these kernels reveals sharply divergent behavior between the TS and SS networks. As depth increases, the equivalent kernel of the TS network falls off ever more rapidly as the angle between input vectors increases. This means that vectors must be an ever closer match to retain a high kernel value. As argued earlier, this highlights the ability of the TS network to pick up on and amplify small differences between inputs, resulting in a quasi-nearest-neighbor behavior. In contrast, the equivalent kernel of the SS network limits to one as depth increases. Thus, rather than amplifying small differences, it collapses them with depth such that even very dissimilar vectors receive high kernel values. 4This proof is succinct using a geometric view, while a longer proof can be found in the Supplementary Material. As the kernel is directly defined as a dot product between feature vectors, it is naturally a valid kernel. 5Interestingly, a similar kernel is also observed by [21] for models with explicit skip connections. 6We write (4) and k SS l differently from [20] for cleaner comparisons against TS-Re LU kernels. However they are numerically unstable expressions and are not used in our experiments to replace the original ones in [20]. Z0 ZL = A0 AL W1 W1 WL WL Scalar Switching W1 WL Tensor Switching Figure 3: Inverted backpropagation learning flowchart, where denotes signal flow, 99K denotes pseudo gradient flow, and = denotes equivalence. (Top row) The SS pathway. (Bottom row) The TS and auxiliary pathways, where Zl s are related by nonlinear expansions, and Al s are related by linear contractions. The resulting AL is equivalent to the alternating expansion and contraction in the SS pathway that yields XL. 4 Learning Algorithms In the following we present 3 learning algorithms suitable for different scenarios. One-pass ridge regression in Sec. 4.1 learns only the linear readout (i.e. analysis weights WZ), leaving the hiddenlayer representations (i.e. activation weights {Wl}) random, hence it is convex and exactly solvable. Inverted backpropagation in Sec. 4.2 learns both analysis and activation weights. Linear Rotation Compression in Sec. 4.3 also learns both weights, but learns activation weights in an indirect way. 4.1 Linear Readout Learning via One-pass Ridge Regression In this scheme, we leverage the intuition that precision in the decision for a hidden unit to activate is less important than carefully tuned analysis weights, which can in part compensate for poorly tuned activation weights. We randomly draw and fix the activation weights {Wl}, and then solve for the analysis weights WZ using ridge regression, which can be done in a single pass through the dataset. First, each data point p = 1, . . . , P is expanded into its tensor representation Zp L and then accumulated into the correlation matrices CZZ = P p vec (Zp L) vec (Zp L) and Cy Z = P p yp vec (Zp L) . After all data points are processed once, the analysis weights are determined as WZ = Cy Z (CZZ + λI) 1 where λ is an L2 regularization parameter. Unlike a standard SS network, which in this setting would only be able to select a linear readout from the top hidden layer to the final classification decision, the TS network offers direct access to entire input vectors, parcellated by the hidden units they activate. In this way, even a linear readout can effectively reach back across layers to the input, implementing a complex function not representable with an SS network with random filters. However, this scheme requires high memory usage, which is on the order of O QL l=0 n2 l for storing CZZ, and even higher computation cost7 for solving WZ, which makes deep architectures (i.e. L > 1) impractical. Therefore, this scheme may best suit online learning applications which allow only one-time access to data, but do not require a deep classifier. 4.2 Representation Learning via Inverted Backpropagation The ridge regression learning uses random activation weights and only learns analysis weights. Here we provide a gradient-based procedure to learn both weights. Learning the analysis weights (i.e. the final linear layer) WZ simply requires LZ WZ , which is generally easy to compute. However, since the activation weights Wl in the TS network only appear inside the Heaviside step function H with zero (or undefined) derivative, the gradient LZ Wl is also zero. To bypass this, we introduce a sequence of auxiliary variables Al defined by A0 = ZL and the recursion Al = Al 1 Wl Rn L n L 1 nl. We then derive the pseudo gradient using the proposed inverted backpropagation as d LZ Wl = LZ where denotes Moore Penrose pseudoinverse. Because the Al s are related via the linear contraction operator, these derivatives are non-zero and easy to compute. We find this works sufficiently well as a non-zero proxy for LZ 7Nonetheless this is a one-time cost and still can be advantageous over other slowly converging algorithms. Our motivation with this scheme is to recover the learning behavior in SS networks. To see this, first note that AL = A0 W1 WL = XL (see Fig. 3). This reflects the fact that the TS and SS networks are linear once the active set of hidden units is known, such that the order of expansion and contraction steps has no effect on the final output. Hence the linear contraction steps, which alternate with expansion steps in (3), can instead be gathered at the end after all expansion steps. The gradient in the SS network is then AL AL 1 Al+1 A0 | {z } LX A0 Replacing LX A0 in (7) with LZ A0 , such that the expanded representation may influence the inverted gradient, we recover (6). Compared to one-pass ridge regression, this scheme controls the memory and time complexities at O (Q l nl), which makes training of a moderately-sized TS network on modern computing resources feasible. The ability to train activation weights also relaxes the assumption that analysis weights can clean up inexact activations caused by using even random weights. 4.3 Indirect Representation Learning via Linear Rotation-Compression Although the inverted backpropagation learning controls memory and time complexities better than the one-pass ridge regression, the exponential growth of a TS network s representation still severely constrains its potential toward being applied in recent deep learning architectures, where network width and depth can easily go beyond, e.g., a thousand. In addition, the success of recent deep learning architectures also heavily depends on the acceleration provided by highly-optimized GPU-enabled libraries, where the operations of the previous learning schemes are mostly unsupported. To address these 2 concerns, we provide a standard backpropagation-compatible learning algorithm, where we no longer keep separate X and Z variables. Instead we define Xl = W l vec (Xl 1 Wl), which directly flattens the expanded representation and linearly projects it against W l Rn l nlnl 1. In this scheme, even though Wl still lacks a non-zero gradient, the W l 1 of the previous layer can be learned using backpropagation to properly rotate Xl 1, such that it can be utilized by Wl and the TS nonlinearity. Therefore, the representation learning here becomes indirect. To simultaneously control the representation size, one can easily let n l < nlnl 1 such that W l becomes compressive. Interestingly, we find n l = nl often works surprisingly well, which suggests linearly compressing the expanded TS representation back to the size of an SS representation can still retain its advantage, and thus is used as the default. This scheme can also be combined with inverted backpropagation if learning Wl is still desired. To understand why linear compression does not remove the TS representation power, we note that it is not equivalent to the linear contraction operation , where each tensor-valued unit is down projected independently. Linear compression introduces extra interaction between tensor-valued units. Another way to view the linear compression s role is through kernel analysis as shown in Sec. 3 adding a linear layer does not change the shape of a given TS kernel. 5 Experimental Results Our experiments focus on comparing TS and SS networks with the goal of determining how the TS nonlinearities differ from their SS counterparts. SVMs using SS-Re LU and TS-Re LU kernels are implemented in Matlab based on libsvm-compact [22]. TS networks and all 3 learning algorithms in Sec. 4 are implemented in Python based on Numpy s ndarray data structure. Both implementations utilize multicore CPU acceleration. In addition, TS networks with only the linear rotation-compression learning are also implemented in Keras, which enjoys much faster GPU acceleration. We adopt 3 datasets, viz. MNIST, CIFAR10 and SVHN2, where we reserve the last 5,000 training images for validation. We also include SVHN2 s extra training set (except for SVMs8) in the training process, and zero-pad MNIST images such that all datasets have the same spatial resolution 32 32. 8Due to the prohibitive kernel matrix size, as SVMs here can only be solved in the dual form. Table 1: Error rate (%) and run time ( ) comparison. MNIST CIFAR10 SVHN2 Time Error Rate Depth One-pass Asymptotic One-pass Asymptotic One-pass Asymptotic SS SVM 1.405 43.187 21.601 1.0 TS SVM 1.403 43.602 20.381 2.1 SS MLP 16.342 2.363 66.411 46.912 30.243 12.203 1.0 TS MLP RR 2.991 47.711 27.111 156.2 TS MLP LRC 3.332 2.062 55.691 46.872 20.422 12.583 11.7 TS MLP IBP-LRC 3.331 2.331 55.691 45.862 20.202 12.633 17.4 SS CNN 43.743+1 1.084+2 74.843+3 26.735+2 13.697+1 4.966+1 1.0 TS CNN LRC 3.855+3 0.866+2 54.403+3 25.748+3 9.137+3 5.066+3 2.0 RR = One-Pass Ridge Regression, LRC = Linear Rotation-Compression, IBP = Inverted Backpropagation. One-pass Error Rate -4% 0% 4% 16% 36% 64% 100% Asymptotic Error Rate MNIST CIFAR10 SVHN2 Seconds 10 100 1000 80% SS L=3+1 SS L=6+2 SS L=9+3 TS L=3+1 TS L=6+2 TS L=9+3 Figure 4: Comparison of SS CNN and TS CNN LRC models. (Left) Each dot s coordinate indicates the differences of one-pass and asymptotic error rates between one pair of SS CNN and TS CNN LRC models sharing the same hyperparameters. The first quadrant shows where the TS CNN LRC is better in both errors. (Right) Validation error rates v.s. training time on CIFAR10 from the shallower, intermediate and deeper models. For SVMs, we grid search for both kernels with depth from 1 to 10, C from 1 to 1, 000, and PCA dimension reduction of the images to 32, 64, 128, 256, or no reduction. For SS and TS networks with fully-connected (i.e. MLP) architectures, we grid search for depth from 1 to 3 and width (including PCA of the input) from 32 to 256 based on our Python implementation. For SS and TS networks with convolutional (i.e. CNN) architectures, we adopt VGG-style [15] convolutional layers with 3 standard SS convolution-max pooling blocks,9 where each block can have up to three 3 3 convolutions, plus 1 to 3 fully-connected SS or TS layers of fixed width 256. CNN experiments are based on our Keras implementation. For all MLPs and CNNs, we universally use SGD with learning rate 10 3, momentum 0.9, L2 weight decay 10 3 and batch size 128 to reduce the grid search complexity by focusing on architectural hyperparameters. All networks are trained for 100 epochs on MNIST and CIFAR10, and 20 epochs on SVHN2, without data augmentation. The source code and scripts for reproducing our experiments are available at https://github.com/coxlab/tsnet. Table 1 summarizes our experimental results, including both one-pass (i.e. first-epoch) and asymptotic (i.e. all-epoch) error rates and the corresponding depths (for CNNs, convolutional and fully-connected layers are listed separately). The TS nonlinearities perform better in almost all categories, confirming our theoretical insights in Sec. 2.3 the cross-layer analysis (as evidenced by their low error rates after only one epoch of training), the error-correcting analysis (on MNIST and CIFAR10, for instance, the one-pass error rates of TS MLP RR using fixed random activation are close to the asymptotic error rates of TS MLP LRC and IBP-LRC with trained activation), and the fine-grained analysis (the TS networks in general achieve better asymptotic error rates than their SS counterparts). 9This decision mainly is to accelerate the experimental process, since TS convolution runs much slower, but we also observe that TS nonlinearities in lower layers are not always helpful. See later for more discussion. Backpropagation (SS MLP) Inverted Backpropagation (TS MLP IBP) Figure 5: Visualization of filters learned on (Top) MNIST, (Middle) CIFAR10 and (Bottom) SVHN2. To further demonstrate how using TS nonlinearities affects the distribution of performance across different architectures (here, mainly depth), we plot the performance gains (viz. one-pass and asymptotic error rates) introduced by using the TS nonlinearities on all CNN variants in Fig. 4. The fact that most dots are in the first quadrant (and none in the third quadrant) suggests the TS nonlinearities are predominantly beneficial. Also, to ease the concern that the TS networks higher complexity may simply consume their advantage on actual run time, we also provide examples of learning progress (i.e. validation error rate) over run time in Fig. 4. The results suggest that even our unoptimized TS network implementation can still provide sizable gains in learning speed. Finally, to verify the effectiveness of inverted backpropagation in learning useful activation filters even without the actual gradient, we train single-hidden-layer SS and TS MLPs with 16 hidden units each (without using PCA dimension reduction of the input) and visualize the learned filters in Fig. 5. The results suggest inverted backpropagation functions equally well. 6 Discussion Why do TS networks learn quickly? In general, the TS network sidesteps the vanishing gradient problem as it skips the long chain of linear contractions against the analysis weights (i.e. the auxiliary pathway in Fig. 3). Its linear readout has direct access to the full input vector, which is switched to different parts of the highly expressive expanded representation. This directly accelerates learning. Also, a well-flowing gradient confers benefits beyond the TS layers e.g. SS layers placed before TS layers also learn faster since the TS layers self-organize rapidly, permitting useful error signals to flow to the lower layers faster.10 Lastly, when using the inverted backpropagation or linear rotationcompression learning, although {Wl} or {W l } do not learn as fast as WZ, and may still be quite random in the first few epochs, the error-correcting nature of WZ can still compensate for the learning progress. Challenges toward deeper TS networks. As shown in Fig. 2, the equivalent kernels of deeper TS networks can be extremely sharp and discriminative, which unavoidably hurts invariant recognition of dissimilar examples. This may explain why we find having TS nonlinearities in only higher (instead of all) layers works better, since the lower SS layers can form invariant representations for the higher TS layers to classify. To remedy this, we may need to consider other types of regularization for WZ (instead of L2) or other smoothing techniques [25, 26]. Future work. Our main future direction is to improve the TS network s scalability, which may require more parallelism (e.g. multi-GPU processing) and more customization (e.g. GPU kernels utilizing the sparsity of TS representations), with preferably more memory storage/bandwidth (e.g. GPUs using 3D-stacked memory). With improved scalability, we also plan to further verify the TS nonlinearity s efficiency in state-of-the-art architectures [27, 9, 18], which are still computationally prohibitive with our current implementation. Acknowledgments We would like to thank James Fitzgerald, Mien Brabeeba Wang, Scott Linderman, and Yu Hu for fruitful discussions. We also thank the anonymous reviewers for their valuable comments. This work was supported by NSF (IIS 1409097), IARPA (contract D16PC00002), and the Swartz Foundation. 10This is a crucial aspect of gradient descent dynamics in layered structures, which behave like a chain the weakest link must change first [23, 24]. [1] Y. Le Cun, Y. Bengio, and G. Hinton, Deep learning, Nature, 2015. [2] J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks, 2015. [3] R. Hahnloser, R. Sarpeshkar, M. Mahowald, R. Douglas, and S. Seung, Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit, Nature, 2000. [4] V. Nair and G. Hinton, Rectified Linear Units Improve Restricted Boltzmann Machines, in ICML, 2010. [5] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies, in A Field Guide to Dynamical Recurrent Networks, 2001. [6] A. Courville, J. Bergstra, and Y. Bengio, A Spike and Slab Restricted Boltzmann Machine, in AISTATS, 2011. [7] K. Konda, R. Memisevic, and D. Krueger, Zero-bias autoencoders and the benefits of co-adapting features, in ICLR, 2015. [8] R. Srivastava, K. Greff, and J. Schmidhuber, Training Very Deep Networks, in NIPS, 2015. [9] K. He, X. Zhang, S. Ren, and J. Sun, Identity Mappings in Deep Residual Networks, in ECCV, 2016. [10] M. Janzamin, H. Sedghi, and A. Anandkumar, Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods, ar Xiv, 2015. [11] L. Deng and D. Yu, Deep Convex Net: A Scalable Architecture for Speech Pattern Classification, in Interspeech, 2011. [12] B. Amos and Z. Kolter, Input-Convex Deep Networks, in ICLR Workshop, 2015. [13] Ö. Aslan, X. Zhang, and D. Schuurmans, Convex Deep Learning via Normalized Kernels, in NIPS, 2014. [14] S. Wang, A. Mohamed, R. Caruana, J. Bilmes, M. Plilipose, M. Richardson, K. Geras, G. Urban, and O. Aslan, Analysis of Deep Neural Networks with the Extended Data Jacobian Matrix, in ICML, 2016. [15] K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, in ICLR, 2015. [16] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, Maxout Networks, in ICML, 2013. [17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going Deeper with Convolutions, in CVPR, 2015. [18] G. Huang, Z. Liu, and K. Weinberger, Densely Connected Convolutional Networks, ar Xiv, 2016. [19] S. Sonoda and N. Murata, Neural network with unbounded activation functions is universal approximator, Applied and Computational Harmonic Analysis, 2015. [20] Y. Cho and L. Saul, Large-Margin Classification in Infinite Neural Networks, Neural Computation, 2010. [21] D. Duvenaud, O. Rippel, R. Adams, and Z. Ghahramani, Avoiding pathologies in very deep networks, in AISTATS, 2014. [22] J. Andén and S. Mallat, Deep Scattering Spectrum, IEEE T-SP, 2014. [23] A. Saxe, J. Mc Clelland, and S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, in ICLR, 2014. [24] A. Saxe, A deep learning theory of perceptual learning dynamics, in COSYNE, 2015. [25] T. Miyato, S. Maeda, M. Koyama, K. Nakae, and S. Ishii, Distributional Smoothing with Virtual Adversarial Training, in ICLR, 2016. [26] Q. Bai, S. Rosenberg, Z. Wu, and S. Sclaroff, Differential Geometric Regularization for Supervised Learning of Classifiers, in ICML, 2016. [27] J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, Striving for Simplicity: The All Convolutional Net, in ICLR Workshop, 2015.