# expressive_power_of_recurrent_neural_networks__800a8ff5.pdf Published as a conference paper at ICLR 2018 EXPRESSIVE POWER OF RECURRENT NEURAL NETWORKS Valentin Khrulkov Skolkovo Institute of Science and Technology valentin.khrulkov@skolkovotech.ru Alexander Novikov National Research University Higher School of Economics Institute of Numerical Mathematics RAS novikov@bayesgroup.ru Ivan Oseledets Skolkovo Institute of Science and Technology Institute of Numerical Mathematics RAS i.oseledets@skoltech.ru Deep neural networks are surprisingly efficient at solving practical tasks, but the theory behind this phenomenon is only starting to catch up with the practice. Numerous works show that depth is the key to this efficiency. A certain class of deep convolutional networks namely those that correspond to the Hierarchical Tucker (HT) tensor decomposition has been proven to have exponentially higher expressive power than shallow networks. I.e. a shallow network of exponential width is required to realize the same score function as computed by the deep architecture. In this paper, we prove the expressive power theorem (an exponential lower bound on the width of the equivalent shallow network) for a class of recurrent neural networks ones that correspond to the Tensor Train (TT) decomposition. This means that even processing an image patch by patch with an RNN can be exponentially more efficient than a (shallow) convolutional network with one hidden layer. Using theoretical results on the relation between the tensor decompositions we compare expressive powers of the HTand TT-Networks. We also implement the recurrent TT-Networks and provide numerical evidence of their expressivity. 1 INTRODUCTION Deep neural networks solve many practical problems both in computer vision via Convolutional Neural Networks (CNNs) (Le Cun et al. (1995); Szegedy et al. (2015); He et al. (2016)) and in audio and text processing via Recurrent Neural Networks (RNNs) (Graves et al. (2013); Mikolov et al. (2011); Gers et al. (1999)). However, although many works focus on expanding the theoretical explanation of neural networks success (Martens & Medabalimi (2014); Delalleau & Bengio (2011); Cohen et al. (2016)), the full theory is yet to be developed. One line of work focuses on expressive power, i.e. proving that some architectures are more expressive than others. Cohen et al. (2016) showed the connection between Hierarchical Tucker (HT) tensor decomposition and CNNs, and used this connection to prove that deep CNNs are exponentially more expressive than their shallow counterparts. However, no such result exists for Recurrent Neural Networks. The contributions of this paper are three-fold. 1. We show the connection between recurrent neural networks and Tensor Train decomposition (see Sec. 4); 2. We formulate and prove the expressive power theorem for the Tensor Train decomposition (see Sec. 5), which on the language of RNNs can be interpreted as follows: to (exactly) emulate a recurrent neural network, a shallow (non-recurrent) architecture of exponentially larger width is required; Published as a conference paper at ICLR 2018 3. Combining the obtained and known results, we compare the expressive power of recurrent (TT), convolutional (HT), and shallow (CP) networks with each other (see Table 2). G2 G1 G3 Gd f (x1) f (x2) f (x3) f (xd) z1 z2 zd 1 ly(X) Figure 1: Recurrent-type neural architecture that corresponds to the Tensor Train decomposition. Gray circles are bilinear maps (for details see Section 4). 2 DEEP LEARNING AND TENSOR NETWORKS In this section, we review the known connections between tensor decompositions and deep learning and then show the new connection between Tensor Train decomposition and recurrent neural networks. Suppose that we have a classification problem and a dataset of pairs {(X(b), y(b))}N b=1. Let us assume that each object X(b) is represented as a sequence of vectors X(b) = (x1, x2, . . . xd), xk Rn, (1) which is often the case. To find this kind of representation for images, several approaches are possible. The approach that we follow is to split an image into patches of small size, possibly overlapping, and arrange the vectorized patches in a certain order. An example of this procedure is 4 1 3 2 5 6 Figure 2: Representation of an image in the form of Eq. (1). A window of size 7 7 moves across the image of size 28 28 extracting image patches, which are then vectorized and arranged into a matrix of size 49 16. presented on Fig. 2. We use lower-dimensional representations of {xk}d k=1. For this we introduce a collection of parameter dependent feature maps {fθℓ: Rn R}m ℓ=1, which are organized into a representation map fθ : Rn Rm. A typical choice for such a map is fθ(x) = σ(Ax + b), that is an affine map followed by some nonlinear activation σ. In the image case if X was constructed using the procedure described above, the map fθ resembles the traditional convolutional maps each image patch is projected by an affine map with parameters shared across all the patches, which is followed by a pointwise activation function. Score functions considered in Cohen et al. (2016) can be written in the form ly(X) = Wy, Φ(X) , (2) Published as a conference paper at ICLR 2018 where Φ(X) is a feature tensor, defined as Φ(X)i1i2...id = fθi1(x1)fθi2(x2) . . . fθid(xd), (3) and Wy Rm m ...m is a trainable weight tensor. Inner product in Eq. (2) is just a total sum of the entry-wise product of Φ(X) and Wy. It is also shown that the hypothesis space of the form Eq. (2) has the universal representation property for m . Similar score functions were considered in Novikov et al. (2016); Stoudenmire & Schwab (2016). Storing the full tensor Wy requires an exponential amount of memory, and to reduce the number of degrees of freedom one can use a tensor decompositions. Various decompositions lead to specific network architectures and in this context, expressive power of such a network is effectively measured by ranks of the decomposition, which determine the complexity and a total number of degrees of freedom. For the Hierarchical Tucker (HT) decomposition, Cohen et al. (2016) proved the expressive power property, i.e. that for almost any tensor Wy its HT-rank is exponentially smaller than its CPrank. We analyze Tensor Train-Networks (TT-Networks), which correspond to a recurrent-type architecture. We prove that these networks also have exponentially larger representation power than shallow networks (which correspond to the CP-decomposition). 3 TENSOR FORMATS REMINDER In this section we briefly review all the necessary definitions. As a d-dimensional tensor X we simply understand a multidimensional array: X Rn1 n2 ... nd. To work with tensors it is convenient to use their matricizations, which are defined as follows. Let us choose some subset of axes s = {i1, i2 . . . ims} of X, and denote its compliment by t = {j1, j2 . . . jd ms}, e.g. for a 4 dimensional tensor s could be {1, 3} and t is {2, 4}. Then matricization of X specified by (s, t) is a matrix X (s,t) Rni1ni2...nims nj1nj2...njd ms , obtained simply by transposing and reshaping the tensor X into matrix, which in practice e.g. in Python, is performed using numpy.reshape function. Let us now introduce tensor decompositions we will use later. 3.1 CANONICAL Canonical decomposition, also known as CANDECOMP/PARAFAC or CP-decomposition for short (Harshman (1970); Carroll & Chang (1970)), is defined as follows X i1i2...id = α=1 vi1 1,αvi2 2,α . . . vid d,α, vi,α Rni. (4) The minimal r such that this decomposition exists is called the canonical or CP-rank of X. We will use the following notation rank CP X = r. When rank CP X = 1 it can be written simply as X i1i2...id = vi1 1 vi2 2 . . . vid d , which means that modes of X are perfectly separated from each other. Note that storing all entries of a tensor X requires O(nd) memory, while its canonical decomposition takes only O(dnr). However, the problems of determining the exact CP-rank of a tensor and finding its canonical decomposition are NP-hard, and the problem of approximating a tensor by a tensor of lower CP-rank is ill-posed. 3.2 TENSOR TRAIN A tensor X is said to be represented in the Tensor Train (TT) format (Oseledets (2011)) if each element of X can be computed as follows X i1i2...id = αd 1=1 Gi1α1 1 Gα1i2α2 2 . . . Gαd 1id d , (5) Published as a conference paper at ICLR 2018 where the tensors Gk Rrk 1 nk rk (r0 = rd = 1 by definition) are the so-called TT-cores. The element-wise minimal ranks r = (r1, . . . rd 1) such that decomposition (5) exists are called TT-ranks rank T T X = r. Note that for fixed values of i1, i2 . . . , id, the right-hand side of Eq. (5) is just a product of matrices G1[1, i1, :]G2[:, i2, :] . . . Gd[:, id, 1]. Storing X in the TT-format requires O(dnr2) memory and thus also achieves significant compression of the data. Given some tensor X, the algorithm for finding its TT-decomposition is constructive and is based on a sequence of Singular Value Decompositions (SVDs), which makes it more numerically stable than CP-format. We also note that when all the TT-ranks equal to each other rank T T X = (r, r, . . . , r), we will sometimes write for simplicity rank T T X = r. 3.3 HIERARCHICAL TUCKER A further generalization of the TT-format leads to the so-called Hierarchical Tucker (HT) format. The definition of the HT-format is a bit technical and requires introducing the dimension tree (Grasedyck, 2010, Definition 3.1). In the next section we will provide an informal introduction into the HT-format, and for more details, we refer the reader to Grasedyck (2010); Grasedyck & Hackbusch (2011); Hackbusch (2012). 4 ARCHITECTURES BASED ON TENSOR DECOMPOSITIONS (a) Bilinear unit (b) Multilinear unit Figure 3: Nodes performing multilinear map of their inputs. d-linear unit is specified by a d + 1 dimensional core G. To construct the tensorial networks we introduce bilinear and multilinear units, which perform a bilinear (multilinear) map of their inputs (see Fig. 3 for an illustration). Suppose that x Rn, y Rm and G Rn m k. Then a bilinear unit G performs a bilinear map G : Rn Rm Rk, defined by the formula G(x, y) = z, i,j Gijkxiyj. (6) Similarly, for x1 Rn1, . . . xd Rnd, a multilinear unit G Rn1 n2 ... nd nj defines a multilinear map G : Qd k=1 Rnk Rnj by the formula G(x1, x2, . . . , xd) = z i1,i2,...,id Gi1i2...idjxi1 1 xi2 2 . . . xid d . (7) Published as a conference paper at ICLR 2018 In the rest of this section, we describe how to compute the score functions ly(X) (see Eq. (1)) for each class label y, which then could be fed into the loss function (such as cross-entropy). The architecture we propose to implement the score functions is illustrated on Fig. 1. For a vector r = (r1, r2, . . . rd 1) of positive integers (rank hyperparameter) we define bilinear units Gk Rrk 1 m rk, with r0 = rd = 1. Note that because r0 = 1, the first unit G1 is in fact just a linear map, and because rd = 1 the output of the network is just a number. On a step k 2 the representation fθ(xk) and output of the unit Gk 1 of size rk are fed into the unit Gk. Thus we obtain a recurrent-type neural network with multiplicative connections and without non-linearities. To draw a connection with the Tensor Train decomposition we make the following observation. For each of the class labels y let us construct the tensor Wy using the definition of TT-decomposition (Eq. (5)) and taking {Gk}d k=1 used for constructing ly(X) as its TT-cores. Using the definition of the Eq. (3) we find that the score functions computed by the network from Fig. 1 are given by the formula ly(X) = X i1,i2,...id W i1i2...id y Φ(X)i1i2...id, (8) which is verified using Eq. (5) and Eq. (3). Thus, we can conclude that the network presented on Fig. 1 realizes the TT-decomposition of the weight tensor. We also note that the size of the output of the bilinear unit Gk in the TT-Network is equal to rk, which means that the TT-ranks correspond to the width of the network. Let us now consider other tensor decompositions of the weight tensors Wy, construct corresponding network architectures, and compare their properties with the original TT-Network. (a) CP-Network (b) HT-Network Figure 4: Examples of networks corresponding to various tensor decompositions. A network corresponding to the CP-decomposition is visualized on Fig. 4a. Each multilinear unit Gα is given by a summand in the formula Eq. (4), namely Gi1i2...id α = vi1 1,αvi2 2,α . . . vid d,α, α {1, . . . r}. Note that the output of each Gα in this case is just a number, and in total there are rank CP Wy multilinear units. Their outputs are then summed up by the Σ node. As before rank of the decomposition corresponds to the width of the network. However, in this case the network is shallow, meaning that there is only one hidden layer. On the Fig. 4b a network of other kind is presented. Tensor decomposition which underlies it is the Hierarchical Tucker decomposition, and hence we call it the HT-Network. It is constructed using a binary tree, where each node other than leaf corresponds to a bilinear unit, and leaves correspond to linear units. Inputs are fed into leaves, and this data is passed along the tree to the root, which outputs a number. Ranks, in this case, are just the sizes of the outputs of the intermediate units. We will denote them by rank HT X. These are networks considered in Cohen et al. (2016), where the expressive power of such networks was analyzed and was argued that they resemble traditional CNNs. In general Hierarchical Tucker decomposition may be constructed using an arbitrary tree, but not much theory is known in general case. Published as a conference paper at ICLR 2018 Our main theoretical results are related to a comparison of the expressive power of these kinds of networks. Namely, the question that we ask is as follows. Suppose that we are given a TT-Network. How complex would be a CPor HT-Network realizing the same score function? A natural measure of complexity, in this case, would be the rank of the corresponding tensor decomposition. To make transitioning between tensor decompositions and deep learning vocabulary easier, we introduce the following table. Table 1: Correspondence between languages of Tensor Analysis and Deep Learning. Tensor Decompositions Deep Learning CP-decomposition shallow network TT-decomposition RNN HT-decomposition CNN rank of the decomposition width of the network 5 THEORETICAL ANALYSIS In this section we prove the expressive power theorem for the Tensor Train decomposition, that is we prove that given a random d-dimensional tensor in the TT format with ranks r and modes n, with probability 1 this tensor will have exponentially large CP-rank. Note that the reverse result can not hold true since TT-ranks can not be larger than CP-ranks: rank T T X rank CP X. It is known that the problem of determining the exact CP-rank of a tensor is NP-hard. To bound CP-rank of a tensor the following lemma is useful. Lemma 1. Let X i1i2...id and rank CP X = r. Then for any matricization X (s,t) we have rank X (s,t) r, where the ordinary matrix rank is assumed. Proof. Proof is based on the following observation. Let Ai1i2...id = vi1 1 vi2 2 . . . vid d , be a CP-rank 1 tensor. Note for any s, t rank A(s,t) = 1, because A(s,t) can be written as uw T for some u and w. Then the statement of the lemma follows from the facts that matricization is a linear operation, and that for matrices rank(A + B) rank A + rank B. We use this lemma to provide a lower bound on the CP-rank in the theorem formulated below. For example, suppose that we found some matricization of a tensor X which has matrix rank r. Then, by using the lemma we can estimate that rank CP X r. Let us denote n = (n1, n2 . . . nd). Set of all tensors X with mode sizes n representable in TT-format with rank T T X r, for some vector of positive integers r (inequality is understood entry-wise) forms an irreducible algebraic variety (Shafarevich & Hirsch (1994)), which we denote by Mr. This means that Mr is defined by a set of polynomial equations in Rn1 n2...nd, and that it can not be written as a union (not necessarily disjoint) of two proper non-empty algebraic subsets. An example where the latter property does not hold would be the union of axes x = 0 and y = 0 in R2, which is an algebraic set defined by the equation xy = 0. The main fact that we use about irreducible algebraic varieties is that any proper algebraic subset of them necessarily has measure 0 (Ilyashenko & Yakovenko (2008)). Published as a conference paper at ICLR 2018 For simplicity let us assume that number of modes d is even, that all mode sizes are equal to n, and we consider Mr with r = (r, r . . . r), so for any X Mr we have rank T T X (r, r, . . . , r), entry-wise. As the main result we prove the following theorem Theorem 1. Suppose that d = 2k is even. Define the following set B = {X Mr : rank CP X < q d 2 }, where q = min{n, r}. Then µ(B) = 0, where µ is the standard Lebesgue measure on Mr. Proof. Our proof is based on applying Lemma 1 to a particular matricization of X. Namely, we would like to show that for s = {1, 3, . . . d 1}, t = {2, 4, . . . d} the following set B(s,t) = {X Mr : rank X (s,t) q d 2 1}, has measure 0. Indeed, by Lemma 1 we have so if µ(B(s,t)) = 0 then µ(B) = 0 as well. Note that B(s,t) is an algebraic subset of Mr given by the conditions that the determinants of all q d 2 q d 2 submatrices of X (s,t) are equal to 0. Thus to show that µ(B(s,t)) = 0 we need to find at least one X such that rank X (s,t) q d 2 . This follows from the fact that because B(s,t) is an algebraic subset of the irreducible algebraic variety Mr, it is either equal to Mr or has measure 0, as was explained before. One way to construct such tensor is as follows. Let us define the following tensors: Gi1α1 1 = δi1α1, G1 R1 n r Gαk 1ikαk k = δikαk 1, Gk Rr n 1, k = 2, 4, 6, . . . , d 2 Gαk 1ikαk k = δikαk, Gk R1 n r, k = 3, 5, 7, . . . , d 1 Gαd 1id d = δidαd 1, Gd Rr n 1 where δiα is the Kronecker delta symbol: δiα = 1, if i = α, 0, if i = α. The TT-ranks of the tensor X defined by the TT-cores (9) are equal to rank T T X = (r, 1, r, . . . , r, 1, r). Lets consider the following matricization of the tensor X X (i1,i3,...,id 1),(i2,i4,...,id) The following identity holds true for any values of indices such that ik = 1, . . . , q, k = 1, . . . , d. X (i1,i3,...,id 1),(i2,i4,...,id) = X α1,...,αd 1 Gi1α1 1 . . . Gαd 1id d = α1,...,αd 1 δi1α1δi2α1δi3α3 . . . δid,αd 1 = δi1i2δi3i4 . . . δid 1id (10) The last equality holds because Pr αk=1 δikαkδik+1αk = δikik+1 for any ik = 1, . . . , q. We obtain that X (i1,i3,...,id 1),(i2,i4,...,id) = δi1i2δi3i4 . . . δid 1id = I(i1,i3,...,id 1),(i2,i4,...,id), (11) Published as a conference paper at ICLR 2018 where I is the identity matrix of size qd/2 qd/2 where q = min{n, r}. To summarize, we found an example of a tensor X such that rank T T X r and the matricization X (i1,i3,...,id 1),(i2,i4,...,id) has a submatrix being equal to the identity matrix of size qd/2 qd/2, and hence rank X (i1,i3,...,id 1),(i2,i4,...,id) qd/2. This means that the canonical rank CP X qd/2 which concludes the proof. In other words, we have proved that for all TT-Networks besides negligible set, the equivalent CPNetwork will have exponentially large width. To compare the expressive powers of the HTand TT-Networks we use the following theorem (Grasedyck, 2010, Section 5.3.2). Theorem 2. For any tensor X the following estimates hold. If rank T T X r, then rank HT X r2. If rank HT X r, then rank T T X r log2(d)/2. It is also known that this bounds are sharp (see Buczy nska et al. (2015)). Thus, we can summarize all the results in the following Table 2. Table 2: Comparison of the expressive power of various networks. Given a network of width r, specified in a column, rows correspond to the upper bound on the width of the equivalent network of other type (we assume that the number of feature maps m is greater than the width of the network r). TT-Network HT-Network CP-Network TT-Network r r log2(d)/2 r HT-Network r2 r r CP-Network r d 2 r d 2 r Example that requires exponential width in a shallow network A particular example used to prove Theorem 1 is not important per se since the Theorem states that TT is exponentially more expressive than CP for almost any tensor (for a set of tensors of measure one). However, to illustrate how the Theorem translates into neural networks consider the following example. Consider the task of getting d input vectors with n elements each and aiming to compute the following measure of similarity between x1, . . . , xd/2 and xd/2+1, . . . , xd: l(X) = (x 1xd/2+1) . . . (x d/2xd) (12) We argue that it can be done with a TT-Network of width n by using the TT-tensor X defined in the proof of Theorem 1 and feeding the input vectors in the following order: x1, xd/2+1, . . . xd/2, xd. The CP-network representing the same function will have nd/2 terms (and hence nd/2 width) and will correspond to expanding brackets in the expression (12). The case of equal TT-cores In analogy to the traditional RNNs we can consider a special class of Tensor Trains with the property that all the intermediate TT-cores are equal to each other: G2 = G3 = = Gd 1, which allows for processing sequences of varied length. We hypothesize that for this class exactly the same result as in Theorem 1 holds i.e. if we denote the variety of Tensor Trains with equal TT-cores by Meq r , we believe that the following hypothesis holds true: Hypothesis 1. Theorem 1 is also valid if Mr is replaced by Meq r . To prove it we can follow the same route as in the proof of Theorem 1. While we leave finding an analytical example of a tensor with the desired property of rank maximality to a future work, we have verified numerically that randomly generated tensors X from Meq r with d = 6, n ranging from 2 to 10 and r ranging from 2 to 20 (we have checked 1000 examples for each possible combination) indeed satisfy rank CP X q d 2 . Published as a conference paper at ICLR 2018 Figure 5: Decision boundaries of the TT-Network on toy 2-D datasets. 6 EXPERIMENTS In this section, we experimentally check if indeed as suggested by Theorem 1 the CP-Networks require exponentially larger width compared to the TT-Networks to fit a dataset to the same level of accuracy. This is not clear from the theorem since for natural data, functions that fit this data may lay in the neglectable set where the ranks of the TTand CP-networks are related via a polynomial function (in contrast to the exponential relationship for all function outside the neglectable set). Other possible reasons why the theory may be disconnected with practice are optimization issues (although a certain low-rank tensor exists, we may fail to find it with SGD) and the existence of the feature maps, which were not taken into account in the theory. To train the TTand CP-Networks, we implemented them in Tensor Flow (Abadi et al. (2015)) and used Adam optimizer with batch size 32 and learning rate sweeping across {4e-3, 2e-3, 1e-3, 5e-4} values. Since we are focused on assessing the expressivity of the format (in contrast to its sensitivity to hyperparameters), we always choose the best performing run according to the training loss. For the first experiment, we generate two-dimensional datasets with Scikit-learn tools moons and circles (Pedregosa et al. (2011)) and for each training example feed the two features as two patches into the TT-Network (see Fig. 5). This example shows that the TT-Networks can implement nontrivial decision boundaries. For the next experiments, we use computer vision datasets MNIST (Le Cun et al. (1990)) and CIFAR10 (Krizhevsky & Hinton (2009)). MNIST is a collection of 70000 handwritten digits, CIFAR-10 is a dataset of 60000 natural images which are to be classified into 10 classes such as bird or cat. We feed raw pixel data into the TTand CP-Networks (which extract patches and apply a trainable feature map to them, see Section 2). In our experiments we choose patch size to be 8 8, feature maps to be affine maps followed by the Re LU activation and we set number of such feature maps to 4. For MNIST, both TTand CP-Networks show reasonable performance (1.0 train accuracy, 0.95 test accuracy without regularizers, and 0.98 test accuracy with dropout 0.8 applied to each patch) even with ranks less than 5, which may indicate that the dataset is too simple to draw any conclusion, but serves as a sanity check. We report the training accuracy for CIFAR-10 on Fig. 6. Note that we did not use regularizers of any sort for this experiment since we wanted to compare expressive power of networks (the best test accuracy we achieved this way on CIFAR-10 is 0.45 for the TT-Network and 0.2 for the CPNetwork). On practice, the expressive power of the TT-Network is only polynomially better than that of the CP-network (Fig. 6), probably because of the reasons discussed above. 7 RELATED WORK A large body of work is devoted to analyzing the theoretical properties of neural networks (Cybenko (1989); Hornik et al. (1989); Shwartz-Ziv & Tishby (2017)). Recent studies focus on depth efficiency (Raghu et al. (2017); Montufar et al. (2014); Eldan & Shamir (2016); Sutskever et al. (2013)), in most cases providing worst-case guaranties such as bounds between deep and shallow networks width. Two works are especially relevant since they analyze depth efficiency from the viewpoint of tensor decompositions: expressive power of the Hierarchical Tucker decomposition (Cohen et al. (2016)) and its generalization to handle activation functions such as Re LU (Cohen Published as a conference paper at ICLR 2018 Figure 6: Train accuracy on CIFAR-10 for the TTand CP-Networks wrt rank of the decomposition and total number of parameters (feature size 4 was used). Note that with rank increase the CPNetworks sometimes perform worse due to optimization issues. & Shashua (2016)). However, all of the works above focus on feedforward networks, while we tackle recurrent architectures. The only other work that tackles expressivity of RNNs is the concurrent work that applies the TT-decomposition to explicitly modeling high-order interactions of the previous hidden states and analyses the expressive power of the resulting architecture (Yu et al., 2017). This work, although very related to ours, analyses a different class of recurrent models. Models similar to the TT-Network were proposed in the literature but were considered from the practical point of view in contrast to the theoretical analyses provided in this paper. Novikov et al. (2016); Stoudenmire & Schwab (2016) proposed a model that implements Eq. (2), but with a predefined (not learnable) feature map Φ. Wu et al. (2016) explored recurrent neural networks with multiplicative connections, which can be interpreted as the TT-Networks with bilinear maps that are shared Gk = G and have low-rank structure imposed on them. 8 CONCLUSION In this paper, we explored the connection between recurrent neural networks and Tensor Train decomposition and used it to prove the expressive power theorem, which states that a shallow network of exponentially large width is required to mimic a recurrent neural network. The downsides of this approach is that it provides worst-case analysis and do not take optimization issues into account. In the future work, we would like to address the optimization issues by exploiting the Riemannian geometry properties of the set of TT-tensors of fixed rank and extend the analysis to networks with non-linearity functions inside the recurrent connections (as was done for CNNs in Cohen & Shashua (2016)). ACKNOWLEDGEMENTS This study was supported by the Ministry of Education and Science of the Russian Federation (grant 14.756.31.0001). Mart ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org. Published as a conference paper at ICLR 2018 Weronika Buczy nska, Jarosław Buczy nski, and Mateusz Michałek. The Hackbusch conjecture on tensor formats. Journal de Math ematiques Pures et Appliqu ees, 104(4):749 761, 2015. J Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional scaling via an N-way generalization of Eckart-Young decomposition. Psychometrika, 35(3):283 319, 1970. Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning, pp. 955 963, 2016. Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pp. 698 728, 2016. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303 314, 1989. Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems, pp. 666 674, 2011. Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pp. 907 940, 2016. Felix A Gers, J urgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM. 1999. Lars Grasedyck. Hierarchical singular value decomposition of tensors. SIAM Journal on Matrix Analysis and Applications, 31(4):2029 2054, 2010. Lars Grasedyck and Wolfgang Hackbusch. An introduction to hierarchical (H-) rank and TT-rank of tensors with examples. Computational Methods in Applied Mathematics Comput. Methods Appl. Math., 11(3):291 304, 2011. Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645 6649. IEEE, 2013. Wolfgang Hackbusch. Tensor spaces and numerical tensor calculus, volume 42. Springer Science & Business Media, 2012. Richard A Harshman. Foundations of the parafac procedure: models and conditions for an explanatory multimodal factor analysis. 1970. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. Yu S Ilyashenko and Sergei Yakovenko. Lectures on analytic differential equations, volume 86. American Mathematical Soc., 2008. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009. Yann Le Cun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pp. 396 404, 1990. Yann Le Cun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. James Martens and Venkatesh Medabalimi. On the expressive efficiency of sum product networks. ar Xiv preprint ar Xiv:1411.7717, 2014. Published as a conference paper at ICLR 2018 Tom aˇs Mikolov, Stefan Kombrink, Luk aˇs Burget, Jan ˇCernock y, and Sanjeev Khudanpur. Extensions of recurrent neural network language model. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pp. 5528 5531. IEEE, 2011. Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924 2932, 2014. Alexander Novikov, Mikhail Trofimov, and Ivan Oseledets. Exponential machines. ar Xiv preprint ar Xiv:1605.03795, 2016. Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295 2317, 2011. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2847 2854, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. Igor Rostislavovich Shafarevich and Kurt Augustus Hirsch. Basic algebraic geometry, volume 2. Springer, 1994. Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. ar Xiv preprint ar Xiv:1703.00810, 2017. Edwin Stoudenmire and David J Schwab. Supervised learning with tensor networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp. 4799 4807. Curran Associates, Inc., 2016. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139 1147, 2013. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1 9, 2015. Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan R Salakhutdinov. On multiplicative integration with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 2856 2864, 2016. Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-term forecasting using tensortrain RNNs. ar Xiv preprint ar Xiv:1711.00073, 2017.