# rational_neural_networks__8500b3e2.pdf Rational neural networks Nicolas Boullé Mathematical Institute University of Oxford Oxford, OX2 6GG, UK boulle@maths.ox.ac.uk Yuji Nakatsukasa Mathematical Institute University of Oxford Oxford, OX2 6GG, UK nakatsukasa@maths.ox.ac.uk Alex Townsend Department of Mathematics Cornell University Ithaca, NY 14853, USA townsend@cornell.edu We consider neural networks with rational activation functions. The choice of the nonlinear activation function in deep learning architectures is crucial and heavily impacts the performance of a neural network. We establish optimal bounds in terms of network complexity and prove that rational neural networks approximate smooth functions more efficiently than Re LU networks with exponentially smaller depth. The flexibility and smoothness of rational activation functions make them an attractive alternative to Re LU, as we demonstrate with numerical experiments. 1 Introduction Deep learning has become an important topic across many domains of science due to its recent success in image recognition, speech recognition, and drug discovery [23, 28, 29, 32]. Deep learning techniques are based on neural networks, which contain a certain number of layers to perform several mathematical transformations on the input. A nonlinear transformation of the input determines the output of each layer in the neural network: x 7 σ(Wx + b), where W is a matrix called the weight matrix, b is a bias vector, and σ is a nonlinear function called the activation function (also called activation unit). The computational cost of training a neural network depends on the total number of nodes (size) and the number of layers (depth). A key question in designing deep learning architectures is the choice of the activation function to reduce the number of trainable parameters of the network while keeping the same approximation power [17]. While smooth activation functions such as sigmoid, logistic, or hyperbolic tangent are widely used, they suffer from the vanishing gradient problem [5] because their derivatives are zero for large inputs. Neural networks based on polynomial activation functions are an alternative [9, 11, 19, 20, 33, 52], but can be numerically unstable due to large gradients for large inputs [5]. Moreover, polynomials do not approximate non-smooth functions efficiently [51], which can lead to optimization issues in classification problems. A popular choice of activation function is the Rectified Linear Unit (Re LU) defined as Re LU(x) = max(x, 0) [26, 39]. It has numerous advantages, such as being fast to evaluate and zero for many inputs [16]. Many theoretical studies characterize and understand the expressivity of shallow and deep Re LU neural networks from the perspective of approximation theory [13, 31, 36, 49, 53]. Re LU networks also suffer from drawbacks, which are most evident during training. The main disadvantage is that the gradient of Re LU is zero for negative real numbers. Therefore, its derivative is zero if the activation function is saturated [34]. To tackle these issues, several adaptations to Re LU have been proposed such as Leaky Re LU [34], Exponential Linear Unit (ELU) [10], Parametric Linear Unit (PRe LU) [22], and Scaled Exponential Linear Unit (SELU) [27]. These modifications outperform Re LU in image classification applications, and some of these activation functions have trainable parameters, which are learned by gradient descent at the same time as the weights and biases of the network. To obtain significant benefits for image classification and partial differential equation (PDE) solvers, one can perform an exhaustive search over trainable activation functions constructed 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. from standard units [25, 48]. However, most of the exotic activation functions in the literature are motivated by empirical results and are not supported by theoretical statements on their potentially improved approximation power over Re LU. In this work, we study rational neural networks, which are neural networks with activation functions that are trainable rational functions. In Section 3, we provide theoretical statements quantifying the advantages of rational neural networks over Re LU networks. In particular, we remark that a composition of low-degree rational functions has a good approximation power but a relatively small number of trainable parameters. Therefore, we show that rational neural networks require fewer nodes and exponentially smaller depth than Re LU networks to approximate smooth functions to within a certain accuracy. This improved approximation power has practical consequences for large neural networks, given that a deep neural network is computationally expensive to train due to expensive gradient evaluations and slower convergence. The experiments conducted in Section 4 demonstrate the potential applications of these rational networks for solving PDEs and Generative Adversarial Networks (GANs).1 The practical implementation of rational networks is straightforward in the Tensor Flow framework and consists of replacing the activation functions by trainable rational functions. Finally, we highlight the main benefits of rational networks: the fast approximation of functions, the trainability of the activation parameters, and the smoothness of the activation function. 2 Rational neural networks We consider neural networks whose activation functions consist of rational functions with trainable coefficients ai and bj, i.e., functions of the form: F(x) = P(x) Q(x) = Pr P i=0 aixi Pr Q j=0 bjxj , a P = 0, b Q = 0, (1) where r P and r Q are the polynomial degrees of the numerator and denominator, respectively. We say that F(x) is of type (r P , r Q) and degree max(r P , r Q). The use of rational functions in deep learning is motivated by the theoretical work of Telgarsky, who proved error bounds on the approximation of Re LU neural networks by high-degree rational functions and vice versa [50]. On the practical side, neural networks based on rational activation functions are considered by Molina et al. [37], who defined a safe Padé Activation Unit (PAU) as F(x) = Pr P i=0 aixi 1 + | Pr Q j=1 bjxj|. The denominator is selected so that F(x) does not have poles located on the real axis. PAU networks can learn new activation functions and are competitive with state-of-the-art neural networks for image classification. However, this choice results in a non-smooth activation function and makes the gradient expensive to evaluate during training. In a closely related work, Chen et al. [8] propose high-degree rational activation functions in a neural network, which have benefits in terms of approximation power. However, this choice can significantly increase the number of parameters in the network, causing the training stage to be computationally expensive. In this paper, we use low-degree rational functions as activation functions, which are then composed together by the neural network to build high-degree rational functions. In this way, we can leverage the approximation power of high-degree rational functions without making training expensive. We highlight the approximation power of rational networks and provide optimal error bounds to demonstrate that rational neural networks theoretically outperform Re LU networks. Motivated by our theoretical results, we consider rational activation functions of type (3, 2), i.e., r P = 3 and r Q = 2. This type appears naturally in the theoretical analysis due to the composition property of Zolotarev sign functions (see Section 3.1): the degree of the overall rational function represented by the rational neural network is a whopping 3#layers, while the number of trainable parameters only grows linearly with respect to the depth of the network. Moreover, a superdiagonal type (3, 2) allows the rational activation function to behave like a nonconstant linear function at , unlike a diagonal type, e.g., (2, 2), or the Re LU function. A low-degree activation function keeps the number of trainable parameters small, while the implicit composition in a neural network gives us the 1All code and hyper-parameters are publicly available at [6]. approximation power of high-degree rationals. This choice is also motivated empirically, and we do not claim that the type (3, 2) is the best choice for all situations as the configurations may depend on the application (see Figure 3 of the Supplementary Material). Our experiments on the approximation of smooth functions and GANs suggest that rational neural networks are an attractive alternative to Re LU networks (see Section 4). We observe that a good initialization, motivated by the theory of rational functions, prevents rational neural networks from having arbitrarily large values. 3 Theoretical results on rational neural networks Here, we demonstrate the theoretical benefit of using neural networks based on rational activation functions due to their superiority over Re LU in approximating functions. We derive optimal bounds in terms of the total number of trainable parameters (also called size) needed by rational networks to approximate Re LU networks as well as functions in the Sobolev space Wn, ([0, 1]d). Throughout this paper, we take ϵ to be a small parameter with 0 < ϵ < 1. We show that an ϵ-approximation on the domain [ 1, 1]d of a Re LU network by a rational neural network must have the following size (indicated in brackets): Rational [Ω(log(log(1/ϵ)))] Re LU Rational [O(log(log(1/ϵ)))], (2) where the constants only depend on the size and depth of the Re LU network. Here, the upper bound means that all Re LU networks can be approximated to within ϵ by a rational network of size O(log(log(1/ϵ))). The lower bound means that there is a Re LU network that cannot be ϵapproximated by a rational network of size less than C log(log(1/ϵ)), for some constant C > 0. In comparison, the size needed by a Re LU network to approximate a rational neural network within the tolerance of ϵ is given by the following inequalities: Re LU [Ω(log(1/ϵ))] Rational Re LU [O(log(1/ϵ))3], (3) where the constants only depend on the size and depth of the rational neural network. This means that all rational networks can be approximated to within ϵ by a Re LU network of size O(log(1/ϵ))3, while there is a rational network that cannot be ϵ-approximated by a Re LU network of size less than Ω(log(1/ϵ)). A comparison between (2) and (3) suggests that rational networks could be more resourceful than Re LU. A key difference between rational networks and neural networks with polynomial activation functions is that polynomials perform poorly on non-smooth functions such as Re LU, with an algebraic convergence of O(1/degree) [51] rather than the (root-)exponential convergence with rationals (see Figure 1 (left)). 3.1 Approximation of Re LU networks by rational neural networks Telgarsky showed that neural networks and rational functions can approximate each other in the sense that there exists a rational function of degree2 O(polylog(1/ϵ)) that is ϵ-close to a Re LU network [50], where ϵ > 0 is a small number. To prove this statement, Telgarsky used a rational function constructed with Newman polynomials [40] to obtain a rational approximation to the Re LU function that converges with square-root exponential accuracy. That is, Telgarsky needed a rational function of degree Ω(log(1/ϵ)2) to achieve a tolerance of ϵ. A degree r rational function can be represented with 2(r + 1) coefficients, i.e., a0, . . . , ar and b0, . . . , br in Equation (1). Therefore, the rational approximation to a Re LU network constructed by Telgarsky requires at least Ω(polylog(1/ϵ)) parameters. In contrast, for any rational function, Telgarsky showed that there exists a Re LU network of size O(polylog(1/ϵ)) that is an ϵ-approximation on [0, 1]d. Our key observation is that by composing low-degree rational functions together, we can approximate a Re LU network much more efficiently in terms of the size (rather than the degree) of the rational network. Our theoretical work is based on a family of rationals called Zolotarev sign functions, which are the best rational approximation on [ 1, ℓ] [ℓ, 1], with 0 < ℓ< 1, to the sign function [3, 43], defined as 1, x < 0, 0, x = 0, 1, x > 0. 2A polylogarithmic function in x is any polynomial in log(x) and is denoted by polylog(x). A composition of k 1 Zolotarev sign functions of type (3, 2) has type (3k, 3k 1) but can be represented with 7k parameters instead of 2 3k + 1. This property enables the construction of a rational approximation to Re LU using compositions of low-degree Zolotarev sign functions with O(log(log(1/ϵ))) parameters in Lemma 1. Lemma 1 Let 0 < ϵ < 1. There exists a rational network R : [ 1, 1] [ 1, 1] of size O(log(log(1/ϵ))) such that R Re LU := max x [ 1,1] |R(x) Re LU(x)| ϵ. Moreover, no rational network of size smaller than Ω(log(log(1/ϵ))) can achieve this. The proof of Lemma 1 (see Supplementary Material) shows that the given bound is optimal in the sense that a rational network requires at least Ω(log(log(1/ϵ))) parameters to approximate the Re LU function on [ 1, 1] to within the tolerance ϵ > 0. The convergence of the Zolotarev sign functions to the Re LU function is much faster, with respect to the number of parameters, than the rational constructed with Newman polynomials (see Figure 1 (left)). 0 50 100 150 200 250 10 15 O(log(1/ϵ)2) O(log(log(1/ϵ))) Polynomial Newman Zolotarev 1 0.5 0 0.5 1 0.2 Re LU Rational Number of parameters Figure 1: Left: Approximation error Re LU r N of the Newman (blue), Zolotarev sign functions (red), and best polynomial approximation [42] of degree N 1 (green) r N to Re LU with respect to the number of parameters required to represent r N. Right: Best rational function of type (3, 2) (red) that approximates the Re LU function (blue). We use this to initialize the rational activation functions when training a rational neural network. The converse of Lemma 1, which is a consequence of a theorem proved by Telgarsky [50, Theorem 1.1], shows that any rational function can be approximated by a Re LU network of size at most O(log(1/ϵ)3). Lemma 2 Let 0 < ϵ < 1. If R : [ 1, 1] [ 1, 1] is a rational function, then there exists a Re LU network f : [ 1, 1] [ 1, 1] of size O(log(1/ϵ)3) such that R f ϵ. To demonstrate the improved approximation power of rational neural networks over Re LU networks (O(log(log(1/ϵ))) versus O(log(1/ϵ)3)), it is known that a Re LU networks that approximates x2, which is rational, to within ϵ on [ 1, 1] must be of size at least Ω(log(1/ϵ)) [31, Theorem 11]. We can now state our main theorem based on Lemmas 1 and 2. Theorem 3 provides bounds on the approximation power of Re LU networks by rational neural networks and vice versa. We regard Theorem 3 as an analogue of [50, Theorem 1.1] for our Zolotarev sign functions, where we are counting the number of training parameters instead of the degree of the rational functions. In particular, our rational networks have high degrees but can be represented with few parameters due to compositions, making training more computationally efficient. While Telgarsky required a rational function with O(k M log(M/ϵ)M) parameters to approximate a Re LU network with fewer than k nodes in each of M layers to within a tolerance of ϵ, we construct a rational network that only has size O(k M log(log(M/ϵ))). Theorem 3 Let 0 < ϵ < 1 and let 1 denote the vector 1-norm. The following two statements hold: 1. Let R : [ 1, 1]d [ 1, 1] be a rational network with M layers and at most k nodes per layer, where each node computes x 7 r(a x + b) and r is a rational function with Lipschitz constant L (a, b, and r are possibly distinct across nodes). Suppose further that a 1 + |b| 1 and r : [ 1, 1] [ 1, 1]. Then, there exists a Re LU network f : [ 1, 1]d [ 1, 1] of size O k M log(MLM/ϵ)3 such that maxx [ 1,1]d |R(x) f(x)| ϵ. 2. Let f : [ 1, 1]d [ 1, 1] be a Re LU network with M layers and at most k nodes per layer, where each node computes x 7 Re LU(a x + b) and the pair (a, b) (possibly distinct across nodes) satisfies a 1 + |b| 1. Then, there exists a rational network R : [ 1, 1]d [ 1, 1] of size O(k M log(log(M/ϵ))) such that maxx [ 1,1]d |f(x) R(x)| ϵ. Theorem 3 highlights the improved approximation power of rational neural networks over Re LU networks. Re LU networks of size O(polylog(1/ϵ)) are required to approximate rational networks while rational networks of size only O(log(log(1/ϵ))) are sufficient to approximate Re LU networks. 3.2 Approximation of functions by rational networks A popular question is the required size and depth of deep neural networks to approximate smooth functions [31, 38, 53]. In this section, we consider the approximation theory of rational networks. In particular, we consider the approximation of functions in the Sobolev space Wn, ([0, 1]d), where n 1 is the regularity of the functions and d 1. The norm of a function f Wn, ([0, 1]d) is defined as f Wn, ([0,1]d) = max |n| n ess sup x [0,1]d |Dnf(x)|, where n is the multi-index n = (n1, . . . , nd) {0, . . . , n}d, and Dnf is the corresponding weak derivative of f. In this section, we consider the approximation of functions from Fd,n := {f Wn, ([0, 1]d), f Wn, ([0,1]d) 1}. By the Sobolev embedding theorem [7], this space contains the functions in Cn 1([0, 1]d), which is the class of functions whose first n 1 derivatives are Lipschitz continuous. Yarotsky derived upper bounds on the size of neural networks with piecewise linear activation functions needed to approximate functions in Fd,n [53, Theorem 1]. In particular, he constructed an ϵ-approximation to functions in Fd,n with a Re LU network of size at most O(ϵ d/n log(1/ϵ)) and depth smaller than O(log(1/ϵ)). The term ϵ d/n is introduced by a local Taylor approximation, while the log(1/ϵ) term is the size of the Re LU network needed to approximate monomials, i.e., xj for j 0, in the Taylor series expansion. We now present an analogue of Yarotsky s theorem for a rational neural network. Theorem 4 Let d 1, n 1, 0 < ϵ < 1, and f Fd,n. There exists a rational neural network R of size O(ϵ d/n log(log(1/ϵ))) and maximum depth O(log(log(1/ϵ))) such that f R ϵ. The proof of Theorem 4 consists of approximating f by a local Taylor expansion. One needs to approximate the piecewise linear functions and monomials arising in the Taylor expansion by rational networks using Lemma 1 and Proposition 6 (see Supplementary Material). The main distinction between Yarotsky s argument and the proof of Theorem 4 is that monomials can be represented by rational neural networks with a size that does not depend on the accuracy of ϵ. In contrast, Re LU networks require O(log(1/ϵ)) parameters. Meanwhile, while Re LU neural networks can exactly approximate piecewise linear functions with a constant number of parameters, rational networks can approximate them with a size of a most O(log(log(1/ϵ))) (see Lemma 1). That is, rational neural networks approximate piecewise linear functions much faster than Re LU networks approximate polynomials. This allows the existence of a rational network approximation to f with exponentially smaller depth (O(log(log(1/ϵ)))) than the Re LU networks constructed by Yarotsky. A theorem proved by De Vore et al. [13] gives a lower bound of Ω(ϵ d/n) on the number of parameters needed by a neural network to express any function in Fd,n with an error ϵ, under the assumption that the weights are chosen continuously. Comparing O(ϵ d/n log(log(1/ϵ))) and O(ϵ d/n log(1/ϵ)), we find that rational neural networks require exponentially fewer nodes than Re LU networks with respect to the optimal bound of Ω(ϵ d/n) to approximate functions in Fd,n. 4 Experiments using rational neural networks In this section, we consider neural networks with trainable rational activation functions of type (3, 2). We select the type (3, 2) based on empirical performance; roughly, a low-degree (but higher than 1) rational function is ideal for generating high-degree rational functions by composition, with a small number of parameters. The rational activation units can be easily implemented in the open-source Tensor Flow library [2] by using the polyval and divide commands for function evaluations. The coefficients of the numerators and denominators of the rational activation functions are trainable parameters, determined at the same time as the weights and biases of the neural network by backpropagation and a gradient descent optimization algorithm. One crucial question is the initialization of the coefficients of the rational activation functions [8, 37]. A badly initialized rational function might contain poles on the real axis, leading to exploding values, or converge to a local minimum in the optimization process. Our experiments, supported by the empirical results of Molina et al. [37], show that initializing each rational function with the best rational approximation to the Re LU function (as described in Lemma 1) produces good performance. The underlying idea is to initialize rational networks near a network with Re LU activation functions, widely used for deep learning. Then, the adaptivity of the rational functions allows for further improvements during the training phase. We represent the initial rational function used in our experiments in Figure 1 (right). The coefficients of this function are obtained by using the minimax command, available in the Chebfun software [14, 15] for numerically computing rational approximations (see Table 1 in the Supplementary Material). In the following experiments, we use a single rational activation function of type (3, 2) at each layer, instead of different functions at each node to reduce the number of trainable parameters and the computational training expense. 4.1 Approximation of functions Raissi, Perdikaris, and Karniadakis [45, 46] introduce a framework called deep hidden physics models for discovering nonlinear partial differential equations (PDEs) from observations. This technique requires to solving the following interpolation problem: given the observation data (ui)1 i N at the spatio-temporal points (xi, ti)1 i N, find a neural network N (called the identification network), that minimizes the loss function i=1 |N(xi, ti) ui|2. (4) This technique has successfully discovered hidden models in fluid mechanics [47], solid mechanics [21], and nonlinear partial differential equations such as the Korteweg de Vries (Kd V) equation [46]. Raissi et al. use an identification network, consisting of 4 layers and 50 nodes per layer, to interpolate samples from a solution to the Kd V equation. Moreover, they observe that networks based on smooth activation functions, such as the hyperbolic tangent (tanh(x)) or the sinusoid (sin(x)), outperform Re LU neural networks [45, 46]. However, the performance of these smooth activation functions highly depends on the application. Moreover, these functions might not be adapted to approximate non-smooth or highly oscillatory solutions. Recently, Jagtap, Kawaguchi, and Karnidakis [25] proposed and analyzed different adaptive activation functions to approximate smooth and discontinuous functions with physics-informed neural networks. More specifically, they use an adaptive version of classical activation functions such as sigmoid, hyperbolic tangent, Re LU, and Leaky Re LU. The choice of these trainable activation functions introduces another parameter in the design of the neural network architecture, which may not be ideal for use for a black-box data-driven PDE solver. 0 10 20 30 40 20 100 101 102 103 104 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 100 101 Re LU Sinusoid Rational Polynomial Validation loss Figure 2: Solution to the Kd V equation used as training data (left) and validation loss of a Re LU (blue), sinusoid (green), rational (red), and polynomial (purple) neural networks with respect to the number of optimization steps (right). We illustrate that rational neural networks can address the issues mentioned above due to their adaptivity and approximation power (see Section 3). Similarly to Raissi [45], we use a solution u to the Kd V equation: ut = uux uxxx, u(x, 0) = sin(πx/20), as training data for the identification network (see the left panel of Figure 2). We train and compare four neural networks, which contain Re LU, sinusoid, rational, and polynomial activation functions, respectively.3 The mean squared error (MSE) of the neural networks on the validation set throughout the training phase is reported in the right panel of Figure 2. We observe that the rational neural network outperforms the sinusoid network, despite having the same asymptotic convergence rate. The network with polynomial activation functions (chosen to be of degree 3 in this example) is harder to train than the rational network, as shown by the non-smooth validation loss (see the right panel of Figure 2). We highlight that rational neural networks are never much bigger in terms of trainable parameters than Re LU networks since the increase is only linear with respect to the number of layers. Here, the Re LU network has 8000 parameters (consisting of weights and biases), while the rational network has 8000 + 7 #layers = 8035. The Re LU, sinusoid, rational, and polynomial networks achieve the following mean square errors after 104 epochs: MSE(u Re LU) = 1.9 10 4, MSE(usin) = 3.3 10 6, MSE(urat) = 1.2 10 7, MSE(upoly) = 3.6 10 5. The absolute approximation errors between the different neural networks and the exact solution to the Kd V equation is illustrated in Figure 2 of the Supplementary Material. The rational neural network is approximatively five times more accurate than the sinusoid network used by Raissi and twenty times more accurate than the Re LU network. Moreover, the approximation errors made by the Re LU network are not uniformly distributed in space and time and located in specific regions, indicating that a network with non-smooth activation functions is not appropriate to resolve smooth solutions to PDEs. 4.2 Generative adversarial networks Generative adversarial networks (GANs) are used to generate fake examples from an existing dataset [18]. They usually consist of two networks: a generator to produce fake samples and a discriminator to evaluate the samples of the generator with the training dataset. Radford et al. [44] describe deep convolutional generative adversarial networks (DCGANs) to build good image representations using convolutional architectures. They evaluate their model on the MNIST and Imagenet image datasets [12, 30]. This section highlights the simplicity of using rational activation functions in existing neural network architectures by training an Auxiliary Classifier GAN (ACGAN) [41] on the MNIST dataset. In particular, the neural network4, denoted by Re LU network 3Details of the parameters used for this experiment are available in the Supplementary Material. 4We use the Tensor Flow implementation available at [1] and provide extended details and results of the experiment in the Supplementary Material. in this section, consists of convolutional generator and discriminator networks with Re LU and Leaky Re LU [34] activation units (respectively) and is used as a reference GAN. As in the experiment described in Section 4.1, we replace the activation units of the generative and discriminator networks by a rational function with trainable coefficients (see Figure 1). We initialize the activation functions in the training phase with the best rational function that approximates the Re LU function on [ 1, 1]. MNIST images Figure 3: Digits generated by a Re LU (top) and rational (bottom) auxiliary classifier generative adversarial network. The right panel contains samples from the first five classes of the MNIST dataset for comparison. We show images of digits from the first five classes generated by a Re LU and rational GANs at different epochs of the training in Figure 3 (the samples are generated randomly and are not manually selected). We observe that a rational network can generate realistic images with a broader range of features than the Re LU network, as illustrated by the presence of bold numbers at the epoch 20 in the bottom panel of Figure 3. However, the digits one generated by the rational network are identical, suggesting that the rational GAN suffers from mode collapse. It should be noted that generative adversarial networks are notoriously tricky to train [17]. The hyper-parameters of the reference model are intensively tuned for a piecewise linear activation function (as shown by the use of Leaky Re LU in the discriminator network). Moreover, many stabilization methods have been proposed to resolve the mode collapse and non-convergence issues in training, such as Wasserstein GAN [4], Unrolled Generative Adversarial Networks [35], and batch normalization [24]. These techniques could be explored and combined with rational networks to address the mode collapse issue observed in this experiment. 5 Conclusions We have investigated rational neural networks, which are neural networks with smooth trainable activation functions based on rational functions. Theoretical statements demonstrate the improved approximation power of rational networks in comparison with Re LU networks. In practice, it seems beneficial to select the activation function as very low-degree rationals, making training more computationally efficient. We emphasize that it is simple to implement rational networks in existing deep learning architectures, such as Tensor Flow, together with the ability to have trainable activation functions. There are many future research directions exploring the potential applications of rational networks in fields such as image classification, time series forecasting, and generative adversarial networks. These applications already employ nonstandard activation functions to overcome various drawbacks of Re LU. Another exciting and promising field is the numerical solution and data-driven discovery of partial differential equations with deep learning. We believe that popular techniques such as physics-informed neural networks [46] could benefit from rational neural networks to improve the robustness and performances of PDE solvers, both from a theoretical and practical viewpoint. Broader Impact Neural networks have applications in diverse fields such as facial recognition, credit-card fraud, speech recognition, and medical diagnosis. There is a growing understanding of the approximation power of neural networks, which is adding theoretical justification to their use in societal applications. We are particularly interested in the future applicability of rational neural networks in discovering and solving of partial differential equations (PDEs). Neural networks, in particular rational neural networks, have the potential to revolutionize fields where PDE models derived by mechanistic principles are lacking. Acknowledgments and Disclosure of Funding The authors thank the National Institute of Informatics (Japan) for funding a research visit, during which this project was initiated. We thank Gilbert Strang for making us aware of Telgarsky s paper [50]. We also thank Matthew Colbrook and Nick Trefethen for their suggestions on the paper. This work is supported by the EPSRC Centre For Doctoral Training in Industrially Focused Mathematical Modelling (EP/L015803/1) in collaboration with Simula Research Laboratory. The work of the third author is supported by the National Science Foundation grant no. 1818757. [1] Keras Git Hub repository, mnist_acgan example. https://github.com/keras-team/keras, 2019. [2] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor Flow: A System for Large-Scale Machine Learning. In 12th USENIX Conference on Operating Systems Design and Implementation, pages 265 283, 2016. [3] Naum I. Achieser. Theory of Approximation. Courier Corporation, 2013. [4] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 214 223, 2017. [5] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE T. Neural Netw., 5(2):157 166, 1994. [6] Nicolas Boullé, Yuji Nakatsukasa, and Alex Townsend. Git Hub repository. https://github.com/ NBoulle/Rational Nets/, 2020. [7] Haim Brezis. Functional Analysis, Sobolev Spaces and Partial Differential Equations. Springer Science & Business Media, 2010. [8] Zhiqian Chen, Feng Chen, Rongjie Lai, Xuchao Zhang, and Chang-Tien Lu. Rational Neural Networks for Approximating Graph Convolution Operator on Jump Discontinuities. In IEEE International Conference on Data Mining (ICDM), pages 59 68, 2018. [9] Xi Cheng, Bohdan Khomtchouk, Norman Matloff, and Pete Mohanty. Polynomial Regression As an Alternative to Neural Nets. ar Xiv preprint ar Xiv:1806.06850, 2018. [10] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). ar Xiv preprint ar Xiv:1511.07289, 2015. [11] Joseph Daws Jr. and Clayton G. Webster. A Polynomial-Based Approach for Architectural Design and Learning with Deep Neural Networks. ar Xiv preprint ar Xiv:1905.10457, 2019. [12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248 255, 2009. [13] Ronald A. De Vore, Ralph Howard, and Charles Micchelli. Optimal nonlinear approximation. Manuscripta Math., 63(4):469 478, 1989. [14] Tobin A. Driscoll, Nicholas Hale, and Lloyd N. Trefethen. Chebfun Guide, 2014. [15] Silviu-Ioan Filip, Yuji Nakatsukasa, Lloyd N. Trefethen, and Bernhard Beckermann. Rational Minimax Approximation via Adaptive Barycentric Representations. SIAM J. Sci. Comput., 40(4):A2427 A2455, 2018. [16] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep Sparse Rectifier Neural Networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 315 323, 2011. [17] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT press, 2016. [18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Advances in Neural Information Processing Systems (Neur IPS), pages 2672 2680, 2014. [19] Mohit Goyal, Rajan Goyal, and Brejesh Lall. Learning Activation Functions: A new paradigm of understanding Neural Networks. ar Xiv preprint ar Xiv:1906.09529, 2019. [20] Stefano Guarnieri, Francesco Piazza, and Aurelio Uncini. Multilayer feedforward networks with adaptive spline activation function. IEEE Trans. Neural Networ., 10(3):672 683, 1999. [21] Ehsan Haghighat, Maziar Raissi, Adrian Moure, Hector Gomez, and Ruben Juanes. A deep learning framework for solution and discovery in solid mechanics: linear elasticity. ar Xiv preprint ar Xiv:2003.02751, 2020. [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1026 1034, 2015. [23] Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag., 29(6):82 97, 2012. [24] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 448 456, 2015. [25] Ameya D. Jagtap, Kenji Kawaguchi, and George E. Karniadakis. Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. J. Comput. Phys., 404:109136, 2020. [26] Kevin Jarrett, Koray Kavukcuoglu, Marc Aurelio Ranzato, and Yann Le Cun. What is the best multi-stage architecture for object recognition? In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2146 2153. IEEE, 2009. [27] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-Normalizing Neural Networks. In Advances in Neural Information Processing Systems (Neur IPS), pages 971 980, 2017. [28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Image Net Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (Neur IPS), pages 1097 1105, 2012. [29] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436 444, 2015. [30] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [31] Shiyu Liang and Rayadurgam Srikant. Why Deep Neural Networks for Function Approximation? ar Xiv preprint ar Xiv:1610.04161, 2016. [32] Junshui Ma, Robert P. Sheridan, Andy Liaw, George E. Dahl, and Vladimir Svetnik. Deep neural nets as a method for quantitative structure activity relationships. J. Chem. Inf. Model., 55(2):263 274, 2015. [33] Liying Ma and Khashayar Khorasani. Constructive feedforward neural networks using Hermite polynomial activation functions. IEEE Trans. Neural Networ., 16(4):821 833, 2005. [34] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning (ICML), volume 30, page 3, 2013. [35] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled Generative Adversarial Networks. ar Xiv preprint ar Xiv:1611.02163, 2016. [36] Hrushikesh N. Mhaskar. Neural Networks for Optimal Approximation of Smooth and Analytic Functions . Neural Comput., 8(1):164 177, 1996. [37] Alejandro Molina, Patrick Schramowski, and Kristian Kersting. Padé Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks. In International Conference on Learning Representations (ICLR), 2019. [38] Hadrien Montanelli, Haizhao Yang, and Qiang Du. Deep Re LU networks overcome the curse of dimensionality for bandlimited functions. ar Xiv preprint ar Xiv:1903.00735, 2019. [39] Vinod Nair and Geoffrey E. Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 807 814, 2010. [40] Donald J. Newman. Rational approximation to |x|. Mich. Math. J., 11(1):11 14, 1964. [41] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier GANs. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70, pages 2642 2651, 2017. [42] Ricardo Pachón and Lloyd N Trefethen. Barycentric-Remez algorithms for best polynomial approximation in the chebfun system. BIT, 49(4):721, 2009. [43] Penco P. Petrushev and Vasil A. Popov. Rational Approximation of Real Functions, volume 28. Cambridge University Press, 2011. [44] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ar Xiv preprint ar Xiv:1511.06434, 2015. [45] Maziar Raissi. Deep hidden physics models: Deep learning of nonlinear partial differential equations. J. Mach. Learn. Res., 19(1):932 955, 2018. [46] Maziar Raissi, Paris Perdikaris, and George E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys., 378:686 707, 2019. [47] Maziar Raissi, Alireza Yazdani, and George E. Karniadakis. Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science, 367(6481):1026 1030, 2020. [48] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for Activation Functions. ar Xiv preprint ar Xiv:1710.05941, 2017. [49] Matus Telgarsky. Benefits of depth in neural networks. ar Xiv preprint ar Xiv:1602.04485, 2016. [50] Matus Telgarsky. Neural networks and rational functions. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70, pages 3387 3393, 2017. [51] Lloyd N. Trefethen. Approximation Theory and Approximation Practice. SIAM, 2013. [52] Lorenzo Vecci, Francesco Piazza, and Aurelio Uncini. Learning and approximation capabilities of adaptive spline activation function neural networks. Neural Netw., 11(2):259 270, 1998. [53] Dmitry Yarotsky. Error bounds for approximations with deep Re LU networks. Neural Netw., 94:103 114, 2017.