# neural_networks_and_rational_functions__d102363f.pdf Neural Networks and Rational Functions Matus Telgarsky 1 Neural networks and rational functions efficiently approximate each other. In more detail, it is shown here that for any Re LU network, there exists a rational function of degree O(poly log(1/ϵ)) which is ϵ-close, and similarly for any rational function there exists a Re LU network of size O(poly log(1/ϵ)) which is ϵ-close. By contrast, polynomials need degree Ω(poly(1/ϵ)) to approximate even a single Re LU. When converting a Re LU network to a rational function as above, the hidden constants depend exponentially on the number of layers, which is shown to be tight; in other words, a compositional representation can be beneficial even for rational functions. 1. Overview Significant effort has been invested in characterizing the functions that can be efficiently approximated by neural networks. The goal of the present work is to characterize neural networks more finely by finding a class of functions which is not only well-approximated by neural networks, but also well-approximates neural networks. The function class investigated here is the class of rational functions: functions represented as the ratio of two polynomials, where the denominator is a strictly positive polynomial. For simplicity, the neural networks are taken to always use Re LU activation σr(x) := max{0, x}; for a review of neural networks and their terminology, the reader is directed to Section 1.4. For the sake of brevity, a network with Re LU activations is simply called a Re LU network. 1.1. Main results The main theorem here states that Re LU networks and rational functions approximate each other well in the sense 1University of Illinois, Urbana-Champaign; work completed while visiting the Simons Institute. Correspondence to: your friend . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 4 spike rat poly net Figure 1. Rational, polynomial, and Re LU network fit to spike , a function which is 1/x along [1/4, 1] and 0 elsewhere. that ϵ-approximating one class with the other requires a representation whose size is polynomial in ln(1 / ϵ), rather than being polynomial in 1/ϵ. Theorem 1.1. 1. Let ϵ (0, 1] and nonnegative integer k be given. Let p : [0, 1]d [ 1, +1] and q : [0, 1]d [2 k, 1] be polynomials of degree r, each with s monomials. Then there exists a function f : [0, 1]d R, representable as a Re LU network of size (number of nodes) O k7 ln(1 / ϵ)3 + min n srk ln(sr / ϵ), sdk2 ln(dsr / ϵ)2o , sup x [0,1]d 2. Let ϵ (0, 1] be given. Consider a Re LU network f : [ 1, +1]d R with at most m nodes in each of at most k layers, where each node computes z 7 σr(a z + b) where the pair (a, b) (possibly distinct across nodes) satisfies a 1 + |b| 1. Then there exists a rational function g : [ 1, +1]d R with degree (maximum degree of numerator and denominator) O ln(k/ϵ)kmk sup x [ 1,+1]d f(x) g(x) ϵ. Neural networks and rational functions Perhaps the main wrinkle is the appearance of mk when approximating neural networks by rational functions. The following theorem shows that this dependence is tight. Theorem 1.2. Let any integer k 3 be given. There exists a function f : R R computed by a Re LU network with 2k layers, each with 2 nodes, such that any rational function g : R R with 2k 2 total terms in the numerator and denominator must satisfy [0,1] |f(x) g(x)| dx 1 Note that this statement implies the desired difficulty of approximation, since a gap in the above integral (L1) distance implies a gap in the earlier uniform distance (L ), and furthermore an r-degree rational function necessarily has 2r + 2 total terms in its numerator and denominator. As a final piece of the story, note that the conversion between rational functions and Re LU networks is more seamless if instead one converts to rational networks, meaning neural networks where each activation function is a rational function. Lemma 1.3. Let a Re LU network f : [ 1, +1]d R be given as in Theorem 1.1, meaning f has at most l layers and each node computes z 7 σr(a z+b) where where the pair (a, b) (possibly distinct across nodes) satisfies a 1 + |b| 1. Then there exists a rational function R of degree O(ln(l/ϵ)2) so that replacing each σr in f with R yields a function g : [ 1, +1]d R with sup x [ 1,+1]d |f(x) g(x)| ϵ. Combining Theorem 1.2 and Lemma 1.3 yields an intriguing corollary. Corollary 1.4. For every k 3, there exists a function f : R R computed by a rational network with O(k) layers and O(k) total nodes, each node invoking a rational activation of degree O(k), such that every rational function g : R R with less than 2k 2 total terms in the numerator and denominator satisfies [0,1] |f(x) g(x)| dx 1 128. The hard-to-approximate function f is a rational network which has a description of size O(k2). Despite this, attempting to approximate it with a rational function of the usual form requires a description of size Ω(2k). Said another way: even for rational functions, there is a benefit to a neural network representation! 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.0 thresh rat poly Figure 2. Polynomial and rational fit to the threshold function. 1.2. Auxiliary results The first thing to stress is that Theorem 1.1 is impossible with polynomials: namely, while it is true that Re LU networks can efficiently approximate polynomials (Yarotsky, 2016; Safran & Shamir, 2016; Liang & Srikant, 2017), on the other hand polynomials require degree Ω(poly(1/ϵ)), rather than O(poly(ln(1/ϵ))), to approximate a single Re LU, or equivalently the absolute value function (Petrushev & Popov, 1987, Chapter 4, Page 73). Another point of interest is the depth needed when converting a rational function to a Re LU network. Theorem 1.1 is impossible if the depth is o(ln(1/ϵ)): specifically, it is impossible to approximate the degree 1 rational function x 7 1/x with size O(ln(1/ϵ)) but depth o(ln(1/ϵ)). Proposition 1.5. Set f(x) := 1/x, the reciprocal map. For any ϵ > 0 and Re LU network g : R R with l layers and m < (27648ϵ) 1/(2l)/2 nodes, Z [1/2,3/4] |f(x) g(x)| dx > ϵ. Lastly, the implementation of division in a Re LU network requires a few steps, arguably the most interesting being a continuous switch statement , which computes reciprocals differently based on the magnitude of the input. The ability to compute switch statements appears to be a fairly foundational operation available to neural networks and rational functions (Petrushev & Popov, 1987, Theorem 5.2), but is not available to polynomials (since otherwise they could approximate the Re LU). 1.3. Related work The results of the present work follow a long line of work on the representation power of neural networks and related functions. The ability of Re LU networks to fit continuous functions was no doubt proved many times, but it appears the earliest reference is to Lebesgue (Newman, 1964, Page 1), though of course results of this type are usu- Neural networks and rational functions ally given much more contemporary attribution (Cybenko, 1989). More recently, it has been shown that certain function classes only admit succinct representations with many layers (Telgarsky, 2015). This has been followed by proofs showing the possibility for a depth 3 function to require exponentially many nodes when rewritten with 2 layers (Eldan & Shamir, 2016). There are also a variety of other result giving the ability of Re LU networks to approximate various function classes (Cohen et al., 2016; Poggio et al., 2017). Most recently, a variety of works pointed out neural networks can approximate polynomials, and thus smooth functions essentially by Taylor s theorem (Yarotsky, 2016; Safran & Shamir, 2016; Liang & Srikant, 2017). This somewhat motivates this present work, since polynomials can not in turn approximate neural networks with a dependence O(poly log(1/ϵ)): they require degree Ω(1/ϵ) even for a single Re LU. Rational functions are extensively studied in the classical approximation theory literature (Lorentz et al., 1996; Petrushev & Popov, 1987). This literature draws close connections between rational functions and splines (piecewise polynomial functions), a connection which has been used in the machine learning literature to draw further connections to neural networks (Williamson & Bartlett, 1991). It is in this approximation theory literature that one can find the following astonishing fact: not only is it possible to approximate the absolute value function (and thus the Re LU) over [ 1, +1] to accuracy ϵ > 0 with a rational function of degree O(ln(1/ϵ)2) (Newman, 1964), but moreover the optimal rate is known (Petrushev & Popov, 1987; Zolotarev, 1877)! These results form the basis of those results here which show that rational functions can approximate Re LU networks. (Approximation theory results also provide other functions (and types of neural networks) which rational functions can approximate well, but the present work will stick to the Re LU for simplicity.) An ICML reviewer revealed prior work which was embarrassingly overlooked by the author: it has been known, since decades ago (Beame et al., 1986), that neural networks using threshold nonlinearities (i.e., the map x 7 1[x 0]) can approximate division, and moreover the proof is similar to the proof of part 1 of Theorem 1.1! Moreover, other work on threshold networks invoked Newman polynomials to prove lower bound about linear threshold networks (Paturi & Saks, 1994). Together this suggests that not only the connections between rational functions and neural networks are tight (and somewhat known/unsurprising), but also that threshold networks and Re LU networks have perhaps more similarities than what is suggested by the differing VC dimension bounds, approximation results, and algorithmic results (Goel et al., 2017). 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Figure 3. Newman polynomials of degree 5, 9, 13. 1.4. Further notation Here is a brief description of the sorts of neural networks used in this work. Neural networks represent computation as a directed graph, where nodes consume the outputs of their parents, apply a computation to them, and pass the resulting value onward. In the present work, nodes take their parents outputs z and compute σr(a z + b), where a is a vector, b is a scalar, and σr(x) := max{0, x}; another popular choice of nonlineary is the sigmoid x 7 (1 + exp( x)) 1. The graphs in the present work are acyclic and connected with a single node lacking children designated as the univariate output, but the literature contains many variations on all of these choices. As stated previously, a rational function f : Rd R is ratio of two polynomials. Following conventions in the approximation theory literature (Lorentz et al., 1996), the denominator polynomial will always be strictly positive. The degree of a rational function is the maximum of the degrees of its numerator and denominator. 2. Approximating Re LU networks with rational functions This section will develop the proofs of part 2 of Theorem 1.1, Theorem 1.2, Lemma 1.3, and Corollary 1.4. 2.1. Newman polynomials The starting point is a seminal result in the theory of rational functions (Zolotarev, 1877; Newman, 1964): there exists a rational function of degree O(ln(1/ϵ)2) which can approximate the absolute value function along [ 1, +1] to accuracy ϵ > 0. This in turn gives a way to approximate the Re LU, since σr(x) = max{0, x} = x + |x| The construction here uses the Newman polynomials (New- Neural networks and rational functions man, 1964): given an integer r, define i=1 (x + exp( i/ r)). The Newman polynomials N5, N9, and N13 are depicted in Figure 3. Typical polynomials in approximation theory, for instance the Chebyshev polynomials, have very active oscillations; in comparison, the Newman polynomials look a little funny, lying close to 0 over [ 1, 0], and quickly increasing monotonically over [0, 1]. The seminal result of Newman (1964) is that |x| x Nr(x) Nr( x) Nr(x) + Nr( x) 3 exp( r)/2. Thanks to this bound and eq. (2.1), it follows that the Re LU can be approximated to accuracy ϵ > 0 by rational functions of degree O(ln(1/ϵ)2). (Some basics on Newman polynomials, as needed in the present work, can be found in Appendix A.1.) 2.2. Proof of Lemma 1.3 Now that a single Re LU can be easily converted to a rational function, the next task is to replace every Re LU in a Re LU network with a rational function, and compute the approximation error. This is precisely the statement of Lemma 1.3. The proof of Lemma 1.3 is an induction on layers, with full details relegated to the appendix. The key computation, however, is as follows. Let R(x) denote a rational approximation to σr. Fix a layer i + 1, and let H(x) denote the multi-valued mapping computed by layer i, and let HR(x) denote the mapping obtained by replacing each σr in H with R. Fix any node in layer i + 1, and let x 7 σr(a H(x) + b) denote its output as a function of the input. Then σr(a H(x) + b) R(a HR(x) + b) σr(a H(x) + b) σr(a HR(x) + b) | {z } + σr(a HR(x) + b) R(a HR(x) + b) | {z } For the first term , note since σr is 1-Lipschitz and by H older s inequality that a (H(x) HR(x)) a 1 H(x) HR(x) , meaning this term has been reduced to the inductive hypothesis since a 1 1. For the second term , if 0.0 0.2 0.4 0.6 0.8 1.0 1.0 rat poly Figure 4. Polynomial and rational fit to . a HR(x) + b can be shown to lie in [ 1, +1] (which is another easy induction), then is just the error between R and σr on the same input. 2.3. Proof of part 2 of Theorem 1.1 It is now easy to find a rational function that approximates a neural network, and to then bound its size. The first step, via Lemma 1.3, is to replace each σr with a rational function R of low degree (this last bit using Newman polynomials). The second step is to inductively collapse the network into a single rational function. The reason for the dependence on the number of nodes m is that, unlike polynomials, summing rational functions involves an increase in degree: p1(x) q1(x) + p1(x) q2(x) = p1(x)q2(x) + p2(x)q1(x) q1(x)q2(x) . 2.4. Proof of Theorem 1.2 The final interesting bit is to show that the dependence on ml in part 2 of Theorem 1.1 (where m is the number of nodes and l is the number of layers) is tight. Recall the triangle function 2x x [0, 1/2], 2(1 x) x (1/2, 1], 0 otherwise. The k-fold composition k is a piecewise affine function with 2k 1 regularly spaced peaks (Telgarsky, 2015). This function was demonstrated to be inapproximable by shallow networks of subexponential size, and now it can be shown to be a hard case for rational approximation as well. Consider the horizontal line through y = 1/2. The function k will cross this line 2k times. Now consider a rational function f(x) = p(x)/q(x). The set of points where f(x) = 1/2 corresponds to points where 2p(x) q(x) = 0. Neural networks and rational functions A poor estimate for the number of zeros is simply the degree of 2p q, however, since f is univariate, a stronger tool becomes available: by Descartes rule of signs, the number of zeros in f 1/2 is upper bounded by the number of terms in 2p q. 3. Approximating rational functions with Re LU networks This section will develop the proof of part 1 of Theorem 1.1, as well as the tightness result in Proposition 1.5 3.1. Proving part 1 of Theorem 1.1 To establish part 1 of Theorem 1.1, the first step is to approximate polynomials with Re LU networks, and the second is to then approximate the division operation. The representation of polynomials will be based upon constructions due to Yarotsky (2016). The starting point is the following approximation of the squaring function. Lemma 3.1 ((Yarotsky, 2016)). Let any ϵ > 0 be given. There exists f : x [0, 1], represented as a Re LU network with O(ln(1/ϵ)) nodes and layers, such that supx [0,1] |f(x) x2| ϵ and f(0) = 0. Yarotsky s proof is beautiful and deserves mention. The approximation of x2 is the function fk, defined as where is the triangle map from Section 2. For every k, fk is a convex, piecewise-affine interpolation between points along the graph of x2; going from k to k + 1 does not adjust any of these interpolation points, but adds a new set of O(2k) interpolation points. Once squaring is in place, multiplication comes via the polarization identity xy = ((x + y)2 x2 y2)/2. Lemma 3.2 ((Yarotsky, 2016)). Let any ϵ > 0 and B 1 be given. There exists g(x, y) : [0, B]2 [0, B2], represented by a Re LU network with O(ln(B/ϵ) nodes and layers, with sup x,y [0,1] |g(x, y) xy| ϵ and g(x, y) = 0 if x = 0 or y = 0. Next, it follows that Re LU networks can efficiently approximate exponentiation thanks to repeated squaring. Lemma 3.3. Let ϵ (0, 1] and positive integer y be given. There exists h : [0, 1] [0, 1], represented by a Re LU network with O(ln(y/ϵ)2) nodes and layers, with sup x,y [0,1] 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.0 Re LU rat poly Figure 5. Polynomial and rational fit to σr. With multiplication and exponentiation, a representation result for polynomials follows. Lemma 3.4. Let ϵ (0, 1] be given. Let p : [0, 1]d [ 1, +1] denote a polynomial with s monomials, each with degree r and scalar coefficient within [ 1, +1]. Then there exists a function q : [0, 1]d [ 1, +1] computed by a network of size O min{sr ln(sr/ϵ), sd ln(dsr/ϵ)2} , which satisfies supx [0,1]d |p(x) q(x)| ϵ. The remainder of the proof now focuses on the division operation. Since multiplication has been handled, it suffices to compute a single reciprocal. Lemma 3.5. Let ϵ (0, 1] and nonnegative integer k be given. There exists a Re LU network q : [2 k, 1] [1, 2k], of size O(k2 ln(1/ϵ)2) and depth O(k4 ln(1/ϵ)3) such that sup x [2 k,1] This proof relies on two tricks. The first is to observe, for x (0, 1], that 1 x = 1 1 (1 x) = X i 0 (1 x)i. Thanks to the earlier development of exponentiation, truncating this summation gives an expression easily approximate by a neural network as follows. Lemma 3.6. Let 0 < a b and ϵ > 0 be given. Then there exists a Re LU network q : R R with O(ln(1/(aϵ))2) layers and O((b/a) ln(1/(aϵ))3) nodes satisfying sup x [a,b] Unfortunately, Lemma 3.6 differs from the desired statement Lemma 3.6: inverting inputs lying within [2 k, 1] requires O(2k ln(1/ϵ)2) nodes rather than O(k4 ln(1/ϵ)3)! Neural networks and rational functions To obtain a good estimate with only O(ln(1/ϵ)) terms of the summation, it is necessary for the input to be x bounded below by a positive constant (not depending on k). This leads to the second trick (which was also used by Beame et al. (1986)!). Consider, for positive constant c > 0, the expression 1 x = c 1 (1 cx) = c X i 0 (1 cx)i. If x is small, choosing a larger c will cause this summation to converge more quickly. Thus, to compute 1/x accurately over a wide range of inputs, the solution here is to multiplex approximations of the truncated sum for many choices of c. In order to only rely on the value of one of them, it is possible to encode a large switch style statement in a neural network. Notably, rational functions can also representat switch statements (Petrushev & Popov, 1987, Theorem 5.2), however polynomials can not (otherwise they could approximate the Re LU more efficiently, seeing as it is a switch statement of 0 (a degree 0 polynomial) and x (a degree 1 polynomial). Lemma 3.7. Let ϵ > 0, B 1, reals a0 a1 an an+1 and a function f : [a0, an+1] R be given. Moreover, suppose for i {1, . . . , n}, there exists a Re LU network gi : R R of size mi and depth ki with gi [0, B] along [ai 1, ai+1] and sup x [ai 1,ai+1] |gi(x) f| ϵ. Then there exists a function g : R R computed by a Re LU network of size O n ln(B/ϵ) + P i mi and depth O ln(B/ϵ) + maxi ki satisfying sup x [a1,an] |g(x) f(x)| 3ϵ. 3.2. Proof of Proposition 1.5 It remains to show that shallow networks have a hard time approximating the reciprocal map x 7 1/x. This proof uses the same scheme as various proofs in (Telgarsky, 2016), which was also followed in more recent works (Yarotsky, 2016; Safran & Shamir, 2016): the idea is to first upper bound the number of affine pieces in Re LU networks of a certain size, and then to point out that each linear segment must make substantial error on a curved function, namely 1/x. The proof is fairly brute force, and thus relegated to the appendices. 4. Summary of figures Throughout this work, a number of figures were presented to show not only the astonishing approximation properties 0.0 0.2 0.4 0.6 0.8 1.0 Figure 6. Polynomial and rational fit to 3. of rational functions, but also the higher fidelity approximation achieved by both Re LU networks and rational functions as compared with polynomials. Of course, this is only a qualitative demonstration, but still lends some intuition. In all these demonstrations, rational functions and polynomials have degree 9 unless otherwise marked. Re LU networks have two hidden layers each with 3 nodes. This is not exactly apples to apples (e.g., the rational function has twice as many parameters as the polynomial), but still reasonable as most of the approximation literature fixes polynomial and rational degrees in comparisons. Figure 1 shows the ability of all three classes to approximate a truncated reciprocal. Both rational functions and Re LU networks have the ability to form switch statements that let them approximate different functions on different intervals with low complexity (Petrushev & Popov, 1987, Theorem 5.2). Polynomials lack this ability; they can not even approximate the Re LU well, despite it being low degree polynomials on two separate intervals. Figure 2 shows that rational functions can fit the threshold function errily well; the particular rational function used here is based on using Newman polynomials to approximate (1 + |x|/x)/2 (Newman, 1964). Figure 3 shows Newman polynomials N5, N9, N13. As discussed in the text, they are unlike orthogonal polynomials, and are used in all rational function approximations except Figure 1, which used a least squares fit. Figure 4 shows that rational functions (via the Newman polynomials) fit very well, whereas polynomials have trouble. These errors degrade sharply after recursing, namely when approximating 3 as in Figure 6. Figure 5 shows how polynomials and rational functions fit the Re LU, where the Re LU representation, based on Newman polynomials, is the one used in the proofs here. Despite the apparent slow convergence of polynomials in this regime, the polynomial fit is still quite respectable. Neural networks and rational functions 5. Open problems There are many next steps for this and related results. 1. Can rational functions, or some other approximating class, be used to more tightly bound the generalization properties of neural networks? Notably, the VC dimension of sigmoid networks uses a conversion to polynomials (Anthony & Bartlett, 1999). 2. Can rational functions, or some other approximating class, be used to design algorithms for training neural networks? It does not seem easy to design reasonable algorithms for minimization over rational functions; if this is fundamental and moreover in contrast with neural networks, it suggests an algorithmic benefit of neural networks. 3. Can rational functions, or some other approximating class, give a sufficiently refined complexity estimate of neural networks which can then be turned into a regularization scheme for neural networks? Acknowledgements The author thanks Adam Klivans and Suvrit Sra for stimulating conversations. Adam Klivans and the author both thank Almare Gelato Italiano, in downtown Berkeley, for necessitating further stimulating conversations, but now on the topic of health and exercise. Lastly, the author thanks the University of Illinois, Urbana-Champaign, and the Simons Institute in Berkeley, for financial support during this work. Anthony, Martin and Bartlett, Peter L. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. Beame, Paul, Cook, Stephen A., and Hoover, H. James. Log depth circuits for division and related problems. SIAM Journal on Computing, 15(4):994 1003, 1986. Cohen, Nadav, Sharir, Or, and Shashua, Amnon. On the expressive power of deep learning: A tensor analysis. 2016. COLT. Cybenko, George. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303 314, 1989. Eldan, Ronen and Shamir, Ohad. The power of depth for feedforward neural networks. In COLT, 2016. Goel, Surbhi, Kanade, Varun, Klivans, Adam, and Thaler, Justin. Reliably learning the relu in polynomial time. In COLT, 2017. Liang, Shiyu and Srikant, R. Why deep neural networks for function approximation? In ICLR, 2017. Lorentz, G. G., Golitschek, Manfred von, and Makovoz, Yuly. Constructive approximation : advanced problems. Springer, 1996. Newman, D. J. Rational approximation to |x|. Michigan Math. J., 11(1):11 14, 03 1964. Paturi, Ramamohan and Saks, Michael E. Approximating threshold circuits by rational functions. Inf. Comput., 112(2):257 272, 1994. Petrushev, P. P. Penco Petrov and Popov, Vasil A. Rational approximation of real functions. Encyclopedia of mathematics and its applications. Cambridge University Press, 1987. Poggio, Tomaso, Mhaskar, Hrushikesh, Rosasco, Lorenzo, Miranda, Brando, and Liao, Qianli. Why and when can deep but not shallow networks avoid the curse of dimensionality: a review. 2017. ar Xiv:1611.00740 [cs.LG]. Safran, Itay and Shamir, Ohad. Depth separation in relu networks for approximating smooth non-linear functions. 2016. ar Xiv:1610.09887 [cs.LG]. Telgarsky, Matus. Representation benefits of deep feedforward networks. 2015. ar Xiv:1509.08101v2 [cs.LG]. Telgarsky, Matus. Benefits of depth in neural networks. In COLT, 2016. Williamson, Robert C. and Bartlett, Peter L. Splines, rational functions and neural networks. In NIPS, 1991. Yarotsky, Dmitry. Error bounds for approximations with deep relu networks. 2016. ar Xiv:1610.01145 [cs.LG]. Zolotarev, E.I. Application of elliptic functions to the problem of the functions of the least and most deviation from zero. Transactions Russian Acad. Scai., pp. 221, 1877.