# elementary_superexpressive_activations__f259cb65.pdf

Elementary Superexpressive Activations

Dmitry Yarotsky 1

Abstract We call a ﬁnite family of activation functions superexpressive if any multivariate continuous function can be approximated by a neural network that uses these activations and has a ﬁxed architecture only depending on the number of input variables (i.e., to achieve any accuracy we only need to adjust the weights, without increasing the number of neurons). Previously, it was known that superexpressive activations exist, but their form was quite complex. We give examples of very simple superexpressive families: for example, we prove that the family {sin, arcsin} is superexpressive. We also show that most practical activations (not involving periodic functions) are not superexpressive.

1. Introduction

In the study of approximations by neural networks, an interesting fact is the existence of activation functions that allow to approximate any continuous function on a given compact domain with arbitrary accuracy by using a network with a ﬁnite, ﬁxed architecture independent of the function and the accuracy (i.e., merely by adjusting the network weights, without increasing the number of neurons). We will refer to this property as superexpressiveness . The existence of superexpressive activations can be seen as a consequence of a result of (Maiorov & Pinkus, 1999). Theorem 1 (Maiorov & Pinkus 1999). There exists an activation function σ which is real analytic, strictly increasing, sigmoidal (i.e., limx σ(x) = 0 and limx + σ(x) = 1), and such that any f C([0, 1]d) can be uniformly approximated with any accuracy by expressions P6d+3 i=1 diσ(P3d j=1 cijσ(Pd k=1 wijkxk + θij) + γi) with some parameters di, cij, wijk, θij, γi.

The proof of this theorem includes two essential steps. In the ﬁrst step, the result is proved for univariate functions

1Skolkovo Institute of Science and Technology, Moscow, Russia. Correspondence to: Dmitry Yarotsky <d.yarotsky@skoltech.ru>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

(i.e., for d = 1). The key idea here is to use the separability of the space C([0, 1]), and construct a (quite complicated) activation by joining all the functions from some dense countable subset. In the second step, one reduces the multivariate case to the univariate one by using the Kolmogorov Superposition Theorem (KST).

Though the superexpressive activations constructed in the proof of Theorem 1 have the nice properties of analyticity, monotonicity and boundedness, they are nevertheless quite complex and non-elementary at least, not known to be representable in terms of ﬁnitely many elementary functions. See the papers (Ismailov, 2014; Guliyev & Ismailov, 2016; 2018a;b) for reﬁnements and algorithmic aspects of such and similar activations, as well as the papers (K urkov a, 1991; 1992; Igelnik & Parikh, 2003; Montanelli & Yang,

2020; Schmidt-Hieber, 2020) for further connections between KST and neural networks.

There is, however, another line of research in which some weaker forms of superexpressiveness have been recently established for elementary (or otherwise simple) activations. The weaker form means that the network must grow to achieve higher accuracy, but this growth is much slower than the power laws expected from the abstract approximation theory under standard regularity assumptions (De Vore et al., 1989). In particular, results of (Yarotsky & Zhevnerchuk, 2019) imply that a deep network having both sin and Re LU activations can approximate Lipschitz functions with error O(e c W 1/2), where c > 0 is a constant and W is the number of weights. Results of (Shen et al., 2020b) (see also (Shen et al., 2020a)) imply that a three-layer network using the ﬂoor , the exponential 2x and the step function 1x 0 as activations can approximate Lipschitz functions with an exponentially small error O(e c W ).

In the present paper, we show that there are activations superexpressive in the initially mentioned strong sense and yet constructed using simple elementary functions; see Section 2. For example, we prove that there are ﬁxed-size networks with the activations sin and arcsin that can approximate any continuous function with any accuracy. On the other hand, we show in Section 3 that most practically used activations (not involving periodic functions) are not superexpressive.

Elementary Superexpressive Activations

Figure 1. An example of network architecture with 3 input neurons, 1 output neuron and 7 hidden neurons.

2. Elementary superexpressive families

Throughout the paper, we consider standard feedforward neural networks. The architecture of the network is deﬁned by a directed acyclic graph connecting the neurons (see Fig. 1). A network implementing a scalar d-variable function has d input neurons, one output neuron and a number of hidden neurons. A hidden neuron computes the value σ(Pn i=1 wizi + h), where wi and h are the weights associated with this neuron, zi are the incoming connections from other hidden or input neurons, and σ is an activation function. We will generally allow different hidden neurons to have different activation functions. The output neuron computes the value Pn i=1 wizi + h without an activation function.

Some of our activations (in particular, arcsin) are naturally deﬁned only on a subset of R. In this case we ensure that the inputs of these activations always belong to this subset.

Throughout the paper, we consider approximations of functions f C([0, 1]d) in the uniform norm . We generally denote vectors by boldface letters; the components of a vector x are denoted x1, x2, . . ..

We give now the key deﬁnition of the paper. Deﬁnition 1. We call a ﬁnite family A of univariate activation functions superexpressive for dimension d if there exists a ﬁxed d-input network architecture with each hidden neuron equipped with some ﬁxed activation from the family A, so that any function f C([0, 1]d) can be approximated on [0, 1]d with any accuracy in the uniform norm by such a network, by adjusting the network weights. We call a family A simply superexpressive if it is superexpressive for all d = 1, 2, . . . We refer to respective architectures as superexpressive for A.

Recall that the Kolmogorov Superposition Theorem (KST) (Kolmogorov, 1957) proves that any multivariate continuous function can be expressed via additions and univariate continuous functions. The following version of this theorem is taken from (Maiorov & Pinkus, 1999). Theorem 2 (KST). There exist d constants λj > 0, j = 1, . . . , d, Pd j=1 λj 1, and 2d + 1 continuous strictly increasing functions χi, i = 1, . . . , 2d + 1, which map [0, 1]

to itself, such that every f C([0, 1]d) can be represented in the form

f(x1, . . . , xd) =

j=1 λjχi(xj)

for some g C([0, 1]) depending on f.

An immediate corollary of this theorem is a reduction of multivariate superexpressiveness to the univariate one.

Corollary 1. If a family A is superexpressive for dimension d = 1, then it is superexpressive for all d. Moreover, the number of neurons and connections in the respective superexpressive architectures scales as O(d2).

The proof follows simply by approximating the functions χi and g in the KST by univariate superexpressive networks.

Our main result establishes existence of simple superexpressive families constructed from ﬁnitely many elementary functions. The full list of properties of the activations that we use is relatively cumbersome, so we ﬁnd it more convenient to just prove the result for a few particular examples rather than attempt to state it in a general form.

Theorem 3. Each of the following families of activation functions is superexpressive:

A1 = {σ1, },

A2 = {sin, arcsin},

where σ1 is any function that is real analytic and nonpolynomial in some interval (α, β) R, and

x, x < 1, 1 π(x arcsin x +

2x, x [ 1, 1], 7 3

πx2 , x > 1.

The function σ3 is C1(R), bounded, and strictly monotone increasing.

The family A1 is a generalization of the family for which (Shen et al., 2020b) proved a weaker superexpressiveness property.

The function σ3 is given as an example of an explicit superexpressive activation that is smooth and sigmoidal (see Fig. 2).

Proof of Theorem 3. We consider the families A1, A2, A3 one by one.

Proof for A1. Given a function f C([0, 1]d), we will construct the approximation ef as a function piecewise constant

Elementary Superexpressive Activations

Figure 2. The function σ3 from the statement of Theorem 3.

on a partition of the cube [0, 1] into a grid of smaller cubes. Following the paper (Shen et al., 2020b), we specify these cubes by mapping them to integers with the help of function . Speciﬁcally, take some M N and let

g M(x1, . . . , xd) = 1 +

k=1 (M + 1)k 1 Mxk . (1)

The function g M is integer-valued and constant on the cubes IM,m = [ m1

M ) . . . [ md

M ) indexed by integer multi-indices m = (m1, . . . , md) Zd. The cube [0, 1]d

overlaps with (M + 1)d such cubes IM,m, namely those with 0 mk M. Each of these cubes is mapped by g M to a unique integer in the range [1, (M + 1)d].

Consider the periodic function

φ(x) = x x , φ : R [0, 1). (2)

We will now seek our approximation in the form

ef(x) = u(g M(x)), u(y) = (B A)φ(sσ1( α+β

2 +wy))+A, (3) where

A = min x [0,1]d f(x), B = max x [0,1]d f(x), (4)

2 is the center of the interval (α, β) where σ1 is analytic and non-polynomial, and s and w are some weights to be chosen shortly. Clearly, the computation deﬁned by Eqs. (1),(2),(3) is representable by a neural network of a ﬁxed size only depending on d (as O(d)) and using activations from A1.

Let N = (M + 1)d. Using the uniform continuity of f and choosing M large so that the size of each cube IM,m is arbitrarily small, we see that the superexpressiveness will be

established if we show that for any N, any ϵ > 0, and any y [A, B]N there exist some weights s and w such that

|u(n) yn| < ϵ for all n = 1, . . . , N. (5)

Recall that a set of numbers a1, . . . , a N is called rationally independent if they are linearly independent over the ﬁeld Q (i.e., no equality PN n=1 λnan = 0 with rational coefﬁcients λn can hold unless all λn = 0). Our strategy will be:

1. to choose the weight w so as to make the values an = σ1( α+β

2 + wn) with n = 1, . . . , N rationally independent;

2. use the density of the irrational winding on the torus to ﬁnd s ensuring condition (5).

For step 1, we state the following lemma.

Lemma 1. Let σ be a real analytic function in an interval (α, β) with β > α. Suppose that there is N such that for all w with sufﬁciently small absolute value, the values (σ( α+β

2 + wn))N n=1 are not rationally independent. Then σ is a polynomial.

Proof. For ﬁxed coefﬁcients λ = (λ1, . . . , λN), the function σλ(w) = PN n=1 λnσ( α+β

2 + nw) is real analytic for w UN = ( β α

2N ). Since there are only countably many λ QN, we see that under hypothesis of the lemma there is some λ such that σλ vanishes on an uncountable subset of UN. Then, by analyticity, σλ 0 on UN. Expanding this σλ into the Taylor series at w = 0, we get the identity PN n=1 λnnm = 0 for each m such that dmσ

2 ) = 0. If there are inﬁnitely many such m, then all λn = 0 (by letting m ). It follows that if λ is nonzero, then there are only ﬁnitely many m s such that dmσ

2 ) = 0, i.e. σ is a polynomial.

Applying Lemma 1 to σ = σ1, we see that for any N there is w such that the values an = σ1( α+β

2 + wn) with n = 1, . . . , N are rationally independent.

For step 2, we use the well-known fact that an irrational winding on the torus is dense:

Lemma 2. Let a1, . . . , a N be rationally independent real numbers. Then the set QN = {(φ(sa1), . . . , φ(sa N)) : s R} (where φ is deﬁned in Eq. (2)) is dense in [0, 1)N.

For completeness, we provide a proof in Appendix A.

Lemma (2) implies that for any y [A, B]N, the point y A B A [0, 1]N can be approximated by vectors (φ(san))N n=1. This implies condition (5), thus ﬁnishing the proof for A1.

Elementary Superexpressive Activations

θ(x) ν(x) ψ(x)

Figure 3. The functions θ, ν, ψ from the proof of Theorem 3.

Proof for A2. We will only give a proof for d = 1; the claim then follows for all larger d by Corollary 1.

Consider the piecewise linear periodic function

π arcsin(sin πx)

and the related functions

ν(x) = x + θ(x),

ψ(x) = ν(θ(x) 1

(see Fig. 3).

We would like to extend the previous proof for A1 to the present case of A2 using the function ν as a substitute for , since ν is constant on the intervals [k 1

2] with odd integer k. However, in contrast to the function , the function ν is continuous and cannot map the whole segment [0, 1] to a ﬁnite set of values, which was crucial in the proof for A1. For this reason, we use a partition of unity and represent the approximated function f C([0, 1]) as a sum of four functions supported on a grid of disjoint small segments. Speciﬁcally, let again M be a large integer determining the scale of our partition of unity. We deﬁne this partition by

q= 1 ψq(x), ψq(x) = ψ(Mx q

2), x R, (6)

and the respective decomposition of the function f by

q= 1 fq, fq = fψq. (7)

For a ﬁxed q, the function ψq, and hence also fq, vanish outside of the union of N = M

2 + O(1) disjoint segments Jq,p = [ 4p 2+q

2M ], p = 1, . . . , N, overlapping with the segment [0, 1]. Denote this union by Jq.

We approximate each function fq by a function efq using an analog of the representation (1),(2),(3):

Gq(x) = ν(Mx q

vq(x) = (2 max x [0,1] |f(x)|)θ(s sin(w Gq(x))), (9)

efq(x) = vq(x)ψq(x), (10)

q= 1 efq. (11)

The function Gq in Eq. (8) is constant and equal to 2p 1 on each segment Jq,p. In particular, different segments Jq,p overlapping with the segment [0, 1] are mapped by Gq to different integers in the interval [1, M + 1].

The function vq in Eq. (9) is the analog of the expression for ef given in Eq. (3). Like Gq, the function vq is constant on each interval Jq,p. By Lemma 1, the values (sin(wm))M+1 m=1 are rationally independent for a suitable w. We can then use again the density of irrational winding on the torus (Lemma 2) to ﬁnd s such that for each p the value vq|Jq,p is arbitrarily close to the value of f at the center xq,p = 4p 1+q

2M of the interval Jq,p. Indeed, θ is a continuous periodic (period 2) function with maxx θ(x) = minx θ(x) = 1 2. For each p = 1, . . . , N, we can ﬁrst ﬁnd zq,p R/(2Z) such that (2 maxx [0,1] |f(x)|)θ(zp,q) = f(xq,p), and then, by Lemma 2, ﬁnd s such that s sin(w(2p 1)) is arbitrarily close to zq,p on the circle R/(2Z) for each p = 1, . . . , N. As a result, we see that the function vq can approximate the function f on the whole set Jq. As before, to achieve an arbitrarily small error, we need to ﬁrst choose M large enough and then choose suitable w and s. (By the uniform continuity of f, one can use here the same w and s for all q { 1, 0, 1, 2}.)

At the same time, it makes no difference how the function vq behaves on the complementary set [0, 1] \ Jq, since ψq vanishes on this set. It follows that efq deﬁned by Eq. (10) can approximate fq deﬁned by Eq. (7) with arbitrarily small error on the whole segment [0, 1]. Then, the function ef given by Eq. (11) can approximate f uniformly on [0, 1] with any accuracy.

The computation (8)-(11) is directly representable by a ﬁxed size neural network with activations {sin, arcsin}, except for multiplication step (10). Multiplication, however, can be implemented with any accuracy by a ﬁxed-size subnetwork:

Lemma 3 (Approximate multiplier). Suppose that an activation function σ has a point x0 where the second derivative d2σ dx2 (x0) exists and is nonzero. Then there is a ﬁxed twoinput network architecture with this activation that allows to implement the approximate multiplication of the inputs, x, y 7 xy, with any accuracy uniformly on any bounded set of inputs x, y, by suitably adjusting the weights.

Elementary Superexpressive Activations

Proof. First note that we can implement the approximate squaring x 7 x2 with any accuracy using just a network with three neurons. Indeed, by the assumption on d2σ

dx2 , for any C, ϵ > 0 we can choose δ such that

dx2 (x0)) 1 1

δ2 (σ(x0+xδ)+σ(x0 xδ) 2σ(x0)) x2| < ϵ

for all |x| < C. Then, using the polarization identity xy = 1 2((x + y)2 + (x y)2), we see that the desired approximate multiplier can be implemented using a ﬁxed 6-neuron architecture.

We can apply this lemma with σ = sin and any x0 = πk, k Z, thus completing the proof for A2.

Proof for A3. We reduce this case to the previous one, A2. First observe that we can approximate the function arcsin by a ﬁxed-size σ3-network.

Lemma 4. A superexpressive family of continuous activations remains superexpressive if some activations are replaced by their antiderivatives.

Proof. The claim follows since any continuous activation σ can be approximated uniformly on compact sets by expressions 1

δ (σ( 1)(x+δ) σ( 1)(x)), where σ( 1) = R σ.

Our activation σ3 is the antiderivative of 1

π arcsin x + 3

2 on the interval [ 1, 1].

Observe next that on the interval [1, ), we can express the function sin x by multiplying σ3(x) by some polynomials in x and subtracting constants. By Lemma 3, these operations can be implemented with any accuracy by a ﬁxed size σ3network. By periodicity of sin, we can then approximate it on any bounded interval.

We conclude that we can approximate any A2-network with any accuracy by a σ3-network that has the same size up to a constant factor.

It is an elementary computation that σ3 is C1(R), bounded and monotone increasing. This completes the proof of the theorem.

A few remarks are in order.

1. A critical element of the proof is having a piecewise linear periodic function in the network. If not directly available, such a function is generated through compositions, multiplications and differentiations.

2. In the proofs for A2, A3, we could handle the case d > 1 without invoking the KST, by a direct construction (as for A1) with a d-dimensional product partition of unity, but this would require the network size to be exponential in d instead of just being O(d2).

3. The approximation rates are not limited by standard bounds (De Vore et al., 1989) because of a highly irregular choice of the parameter s.

3. Absence of superexpressiveness for standard activations

In this section we show that most practically used activation functions (those not involving sin x or cos x) are not superexpressive. This is an easy consequence of Khovanskii s bounds on the number of zeros of elementary functions (Khovanskii, 1991). We remark that these bounds have been used previously to bound expressiveness of neural networks in terms of VC dimension (Karpinski & Macintyre, 1997) or Betti numbers of level sets (Bianchini & Scarselli, 2014).

First recall the standard deﬁnition of Pfafﬁan functions (see e.g. (Khovanskii, 1991; Zell, 1999; Gabrielov & Vorobjov, 2004)). A Pfafﬁan chain is a sequence f1, . . . , fl of real analytic functions deﬁned on a common connected domain U Rd and such that the equations

fi xj (x) = Pij(x, f1(x), . . . , fi(x)), 1 i l, 1 j d

hold in U for some polynomials Pij. A Pfafﬁan function in the chain (f1, . . . , fl) is a function on U that can be expressed as a polynomial P in the variables (x, f1(x), . . . , fl(x)). Complexity of the Pfafﬁan function f is the triplet (l, α, β) consisting of the length l of the chain, the maximum degree α of the polynomials Pij, and the degree β of the polynomial P.

The importance of Pfafﬁan functions stems from the fact that they include all elementary functions when considered on suitable domains. This is shown by ﬁrst checking that the simplest elementary functions are Pfafﬁan, and then by checking that arithmetic operations and compositions of Pfafﬁan functions produce again Pfafﬁan functions. We refer again to (Khovanskii, 1991; Zell, 1999; Gabrielov & Vorobjov, 2004) for details.

Proposition 1.

1. (Elementary examples) The following functions are Pfafﬁan: polynomials on U = Rd, ex on R, ln x on R+, arcsin x on ( 1, 1). The function sin x is Pfafﬁan on any bounded interval (A, B), with complexity depending on B A, but sin x is not Pfafﬁan on R.

2. (Operations with Pfafﬁan functions) Sums and products of Pfafﬁan functions f, g with a common domain U are Pfafﬁan. If the domain of a Pfafﬁan function f includes the range of a Pfafﬁan function g, then the composition f g is Pfafﬁan on the domain of g. The complexity of the resulting functions f + g, fg, f g is determined by the complexity of the functions f, g.

Elementary Superexpressive Activations

We state now the fundamental result on Pfafﬁan functions. We call a solution x Rd of a system f1(x) = . . . = fd(x) = 0 nondegenerate if the respective Jacobi matrix fi xj (x) is nondegenerate.

Theorem 4 (Khovanskii 1991). Let f1, . . . , fd be Pfafﬁan d-variable functions with a common Pfafﬁan chain on a connected domain U. Then the number of nondegenerate solutions of the system f1(x) = . . . = fd(x) = 0 is bounded by a ﬁnite number only depending on the complexities of the functions f1, . . . , fd.

The idea of the proof is to use a generalized Rolle s lemma and bound the number of common zeros of the functions fk by the number of common zeros of suitable polynomials (in a larger number of variables). The latter number can then be upper bounded using the classical B ezout theorem. It is possible to write the bound in Theorem 4 explicitly, but we will not need that for our purposes.

We will only use the univariate version of Theorem 4. In this case, it will also be easy to remove the inconvenient nondegeneracy condition in this theorem. (Note that this condition is essential in general for example, if f1(x1, x2) f2(x1, x2) = x1, then the system f1 = f2 = 0 has inﬁnitely many degenerate solutions).

Proposition 2. Let f be a univariate Pfafﬁan function on an open interval I R. Then either f 0 on I, or the number of zeros of f is bounded by a ﬁnite number only depending on the complexity of f.

Proof. Suppose f 0. Then, by real analiticity of f, any zero x0 of f in I is isolated, and we can write f(x) = c(x x0)k(1 + o(1)) as x x0, with some c = 0 and k N. By Sard s theorem, there is a sequence ϵn 0 such that the values ϵn are not critical values of f. The functions f ϵn are Pfafﬁan with the same complexity as f, and don t have degenerate zeros. For any zero x0 of f, the two functions f ϵn will have in total two nondegenerate zeros in a vicinity of x0, for any ϵn small enough. It follows that the total number of all nondegenerate zeros of the two functions f ϵn, for ϵn small enough, will be at least twice as large as the number or zeros of the function f (or can be made arbitrarily large if f has inﬁnitely many zeros). Applying Theorem 4 to the functions f ϵn, we obtain the desired conclusion on the zeros of f.

Now we apply these results to standard activation functions.

Deﬁnition 2. We say that an activation function σ is piecewise Pfafﬁan if its domain of deﬁnition can be represented as a union of ﬁnitely many open intervals Un and points xk in R so that σ is Pfafﬁan on each Un.

By discussion above, this deﬁnition covers most practically used activations, such as tanh x, standard sigmoid σ(x) =

(1+e x) 1, Re LU σ(x) = max(0, x), leaky Re LU σ(x) =

max(ax, x), binary step function σ(x) =

( 0, x < 0 1, x 0,

Gaussian σ(x) = e x2, softplus σ(x) = ln(1+ex) (Glorot

et al., 2011), ELU σ(x) =

( a(ex 1), x < 0 x, x 0 (Clevert

et al., 2015), etc. Our main result in this section states that any ﬁnite collection of such activations is not superexpressive.

Theorem 5. Let A be a family of ﬁnitely many piecewise Pfafﬁan activation functions. Then A is not superexpressive.

Proof. Suppose that A is superexpressive, and there is a ﬁxed one-input network architecture allowing us to approximate any univariate function f C([0, 1]). Then for any N we can choose the network weights so that the function ef implemented by the network has at least N sign changes, in the sense that there are points 0 a0 < . . . < a N 1 such that ( 1)n ef(an) > 0 for all n. Indeed, this follows simply by approximating the function f(x) = sin((N +1)πx) with an error less than 1. We will show, however, that this N cannot be arbitrarily large if the activations are from a ﬁnite piecewise Pfafﬁan family.

Lemma 5. If the activations belong to a ﬁnite piecewise Pfafﬁan family A, then any function ef implemented by the network is piecewise Pfafﬁan. Moreover, the number of respective intervals Un as well as the complexity of each restriction f|Un do not exceed some ﬁnite values only depending on the family A and the network architecture.

Proof. This can be proved by induction on the number of hidden neurons in the network. The base of induction corresponds to networks without hidden neurons; in this case the statement is trivial. Now we make the induction step. Given a network, choose some hidden neuron whose output is not used by other hidden neurons (i.e., choose a neuron in the last hidden layer ). With respect to this neuron, we can decompose the network output as

ef(x) = cσk K X

s=1 cs efs(x) + h +

s=1 c s efs(x) + h . (12)

Here, σk is the activation function residing at the chosen neuron, efs are the signals going out of the other hidden and input neurons, and c, cs, c s, h, h are various weights. By inductive hypothesis, all functions efs here are piecewise Pfafﬁan. Moreover, by taking intersections, the segment [0, 1] can be divided into ﬁnitely many open intervals Ij separated by ﬁnitely many points xl so that each of the functions efs is Pfafﬁan on each interval Ij. The number of these intervals Ij and the complexities of fs|Ij are bounded by

Elementary Superexpressive Activations

some ﬁnite values depending only on the family A and the network architecture. We see also that the linear combination F(x) = PK s=1 cs efs(x) + h appearing in Eq. (12) is Pfafﬁan on each interval Ij.

Observe next that the composition σk F is piecewise Pfafﬁan on each interval Ij. Indeed, let U (k) r and x(k) r be the ﬁnitely many open intervals and points associated with the activation σk as a piecewise Pfafﬁan function. By Proposition 2, for each r, the pre-image (F|Ij) 1(x(k) r ) is either the whole interval Ij or its ﬁnite subset. In the ﬁrst case, σk F is constant and thus trivially Pfafﬁan on Ij. In the second case, the interval Ij can be subdivided into sub-intervals Ij,m such that each image F(Ij,m) belongs to one of the intervals U (k) r so that σk F is Pfafﬁan on Ij,m. The number of these sub-intervals and the complexities of the restrictions are bounded by some ﬁnite numbers depending on the activation σk and the complexity of F|Ij.

Returning to representation (12), we see that ef is Pfafﬁan on each interval Ij,m; moreover, the total number of these intervals as well as the complexities of the restrictions ef|Ij,m are bounded by ﬁnite numbers determined by the family A and the architecture, thus proving the claim.

The lemma implies that some interval Un in which ef is Pfafﬁan and has a bounded complexity can contain an arbitrarily large number of sign changes of ef. This gives a contradiction with Proposition 2.

We remark that one can also give a proof of this theorem expressed in terms of Vapnik-Chervonenkis dimension: ﬁrst we note that if A is superexpressive then there must be a network architecture with an inﬁnite VC-dimension (after thresholding), and then we note that if the activations in A are piecewise Pfafﬁan then the VC-dimension of any ﬁxed architecture is ﬁnite.

4. Discussion

We have given examples of simple explicit activation functions that allow to approximate arbitrary functions using ﬁxed-size networks (Theorem 3), and we have also shown that this can not be achieved with the common practical activations (Theorem 5). We mention two interesting questions left open by our results.

First, our existence result (Theorem 3) is of course purely theoretical: though the network is small, a huge approximation complexity is hidden in the very special choice of the network weights. Nevertheless, assuming that we can perform computations with any precision, one can ask if it is possible to algorithmically ﬁnd network weights providing a good approximation. The main difﬁculty here is to ﬁnd

a value s such that (φ(san))N n=1 is close to the given Ndimensional point. Such a value exists by Lemma 2 on the density of irrational winding, and the proof of the lemma is essentially constructive, so theoretically one can perform the necessary computation and ﬁnd the desired s. However, the proof is based on the pigeonhole principle and is very prone to the curse of dimensionality (with dimensionality here corresponding to the number N of ﬁtted data points), making this computation practically unfeasible even for relatively small N.

Another open question is whether the function sin alone is superexpresive. This can not be ruled out by the methods of Section 3, since sin has an inﬁnite Pfafﬁan complexity on R. More generally, one can ask if there are individual superexpressive activations that are elementary and real analytic on the whole R. A repeated computation of antiderivatives using Lemma 4 allows us to construct a piecewise elementary superexpressive function of any ﬁnite smoothness, but not analytic on R.

A. Proof of Lemma 2

It is convenient to endow the cube [0, 1)N with the topology of the torus TN = RN/ZN by gluing the endpoints of the interval [0, 1]. Though the lemma is stated in terms of the original topology on [0, 1)N, it is clear that a subset is dense in the original topology if and only if it is dense in the topology of the torus. Accordingly, when considering the distance between two points b1, b2 [0, 1)N, it will be convenient to use the distance between the corresponding cosets, i.e. ρ(b1, b2) = minz1,z2 ZN |b1+z1 (b2+z2)|, where | | is the usual euclidean norm. Note that this ρ is a shift invariant metric on the torus.

The proof of the lemma is by induction on N. The base N = 1 is obvious (a single number a1 is rationally independent iff a1 = 0). Let us make the induction step from N 1 to N, with N 2.

Given the rationally independent numbers a1, . . . , a N, ﬁrst observe that none of them equals 0. Let s0 = 1 a N . Let φ(x) = x x as in Eq. (2). If s = ms0 with some integer m, then φ(ms0a N) = 0, so that the points bm = (φ(ms0a1), . . . , φ(ms0a N)) lie in the (N 1)-dimensional face [0, 1)N 1 of the full set [0, 1)N.

Observe that the points bm are different for different integer m s. Indeed, if bm1 = bm2 for some integer m1 = m2, then there are some integers p1, . . . , p N such that (m1 m2)s0an = pn for all n = 1, . . . , N. But then the numbers a1, . . . , a N are not rationally independent, since, e.g., (m1 m2)a1 = p1

s0 = p1a N.

Since the points bm are distinct, they form an inﬁnite set in [0, 1)N 1. Then for any ϵ we can ﬁnd a pair of

Elementary Superexpressive Activations

different points bm1 and bm2 separated by a distance ρ(bm1, bm2) < ϵ. Note that the distance ρ(bm1, bm2) only depends on the difference m2 m1, so we can assume that m1 = 0:

ρ(b0, bm2) = ρ(0, bm2) < ϵ.

By deﬁnition of ρ, we can then ﬁnd z ZN such that for b m2 = bm2 z we have

|b m2| = ρ(0, bm2) < ϵ. (13)

We can write b m2 in the form

b m2 = (m2s0a1 p1, . . . , m2s0a N 1 p N 1, 0)

with some integers p1, . . . , p N 1. Observe that the ﬁrst N 1 components b m2,n of b m2 are rationally independent. Indeed, if PN 1 n=1 λnb m2,n = 0 with some rational λn, then, by expressing this identity in terms of the original values an, we get

n=1 λnan 1 m2

n=1 λnpna N = 0,

so λn 0 by the rational independence of an.

Consider now the set

Q N 1 = {φ(tb m2,1), . . . , φ(tb m2,N 1)) : t R}.

On the one hand, by induction hypothesis, the set Q N 1 is dense in [0, 1)N 1, because the numbers b m2,n are rationally independent. On the other hand, observe that the points in Q N 1 corresponding to integer t also belong to the set QN = {(φ(sa1), . . . , φ(sa N)) : s R}: specifically, the respective s = m2ts0. It follows that for any b [0, 1)N 1 we can ﬁnd a point eb of the set QN at a distance at most 2ϵ from b: ﬁrst ﬁnd a point bb Q N 1 such that |bb b| < ϵ, and then, if bb corresponds to some t = t0 in Q N 1, take eb corresponding to t = t0 . The distance ρ(bb, eb) < ϵ by Eq. (13) and because |t0 t0 | < 1:

ρ(bb, eb) |t0b m2 t0 b m2| < |b m2| < ϵ.

The above argument shows that the face [0, 1)N 1 = {b [0, 1)N : b N = 0} can be approximated by points of QN with s belonging to the set S = {m2ts0}t Z. Any other (N 1)-dimensional cross-section {b [0, 1)N : b N = c} is then approximated by the points of QN with s S + cs0: indeed, s = cs0 gives us one point in this cross-section, and additional shifts by s S allow us to approximate any other point with the same b N.

Acknowledgment

I thank Maksim Velikanov and the anonymous reviewers for useful feedback on the preliminary version of the paper. Research was supported by Russian Science Foundation, grant 21-11-00373.

Bianchini, M. and Scarselli, F. On the complexity of neural network classiﬁers: A comparison between shallow and deep architectures. IEEE transactions on neural networks and learning systems, 25(8):1553 1565, 2014.

Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). ar Xiv preprint ar Xiv:1511.07289, 2015.

De Vore, R. A., Howard, R., and Micchelli, C. Optimal nonlinear approximation. Manuscripta mathematica, 63 (4):469 478, 1989.

Gabrielov, A. and Vorobjov, N. Complexity of computations with Pfafﬁan and Noetherian functions. Normal forms, bifurcations and ﬁniteness problems in differential equations, 137:211 250, 2004.

Glorot, X., Bordes, A., and Bengio, Y. Deep sparse rectiﬁer neural networks. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pp. 315 323, 2011.

Guliyev, N. J. and Ismailov, V. E. A single hidden layer feedforward network with only one neuron in the hidden layer can approximate any univariate function. Neural computation, 28(7):1289 1304, 2016.

Guliyev, N. J. and Ismailov, V. E. Approximation capability of two hidden layer feedforward neural networks with ﬁxed weights. Neurocomputing, 316:262 269, 2018a.

Guliyev, N. J. and Ismailov, V. E. On the approximation by single hidden layer feedforward neural networks with ﬁxed weights. Neural Networks, 98:296 304, 2018b.

Igelnik, B. and Parikh, N. Kolmogorov s spline network. IEEE transactions on neural networks, 14(4):725 733, 2003.

Ismailov, V. E. On the approximation by neural networks with bounded number of neurons in hidden layers. Journal of Mathematical Analysis and Applications, 417(2): 963 969, 2014.

Karpinski, M. and Macintyre, A. Polynomial bounds for VC dimension of sigmoidal and general Pfafﬁan neural networks. Journal of Computer and System Sciences, 54 (1):169 176, 1997.

Elementary Superexpressive Activations

Khovanskii, A. G. Fewnomials. Vol. 88 of Translations of Mathematical Monographs. American Mathematical Society, 1991.

Kolmogorov, A. N. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. In Doklady Akademii Nauk, volume 114, pp. 953 956. Russian Academy of Sciences, 1957.

K urkov a, V. Kolmogorov s theorem is relevant. Neural computation, 3(4):617 622, 1991.

K urkov a, V. Kolmogorov s theorem and multilayer neural networks. Neural networks, 5(3):501 506, 1992.

Maiorov, V. and Pinkus, A. Lower bounds for approximation by mlp neural networks. Neurocomputing, 25(1-3):81 91, 1999.

Montanelli, H. and Yang, H. Error bounds for deep Re LU networks using the Kolmogorov Arnold superposition theorem. Neural Networks, 129:1 6, 2020.

Schmidt-Hieber, J. The Kolmogorov-Arnold representation theorem revisited. ar Xiv preprint ar Xiv:2007.15884, 2020.

Shen, Z., Yang, H., and Zhang, S. Deep network approximation with discrepancy being reciprocal of width to power of depth. ar Xiv preprint ar Xiv:2006.12231, 2020a.

Shen, Z., Yang, H., and Zhang, S. Neural network approximation: Three hidden layers are enough. ar Xiv preprint ar Xiv:2010.14075, 2020b.

Yarotsky, D. and Zhevnerchuk, A. The phase diagram of approximation rates for deep neural networks. ar Xiv preprint ar Xiv:1906.09477, 2019.

Zell, T. Betti numbers of semi-Pfafﬁan sets. Journal of Pure and Applied Algebra, 139(1-3):323 338, 1999.