# structure_of_universal_formulas__cfb9ea9b.pdf Structure of universal formulas Dmitry Yarotsky Skoltech d.yarotsky@skoltech.ru By universal formulas we understand parameterized analytic expressions that have a fixed complexity, but nevertheless can approximate any continuous function on a compact set. There exist various examples of such formulas, including some in the form of neural networks. In this paper we analyze the essential structural elements of these highly expressive models. We introduce a hierarchy of expressiveness classes connecting the global approximability property to the weaker property of infinite VC dimension, and prove a series of classification results for several increasingly complex functional families. In particular, we introduce a general family of polynomially-exponentially-algebraic functions that, as we prove, is subject to polynomial constraints. As a consequence, we show that fixed-size neural networks with not more than one layer of neurons having transcendental activations (e.g., sine or standard sigmoid) cannot in general approximate functions on arbitrary finite sets. On the other hand, we give examples of functional families, including two-hidden-layer neural networks, that approximate functions on arbitrary finite sets, but fail to do that on the whole domain of definition. 1 Introduction By universal formulas we broadly (informally) understand parameterized explicit analytic expressions that have a fixed complexity (as expressions) but nevertheless can approximate any continuous function on a compact set. An example of such formula, given by Boshernitzan [1], is y(x) = Z x+a bd 1 + d2 cos(bt) cos(et)dt + c, (1) where d > 0, a, b and c are parameters. [1] proves that, by varying the parameters, functions (1) can uniformly approximate any continuous function on any compact interval in R. Expression (1) involves integration. Laczkovich and Ruzsa [13] prove existence of universal formulas representable as elementary functions without integration (moreover, their formulas can uniformly approximate continuous functions on the whole R under a mild growth assumption). One can further restrict the types of operations and show that universal formulas can be realized by fixed-size classical neural networks with suitable activation functions [29]. By a classical neural network we mean a computational model consisting of units ( hidden neurons ), each computing a map z1, . . . , zn 7 σ(Pn k=1 wkzk + h), where z1, . . . , zn (signals) are outputs of some previous neurons, and wk and h are weights (parameters) of the neuron. In addition to hidden neurons, one distinguishes input neurons that just represent the scalar components of the network input, and output neuron(s) that represent the scalar component(s) of the output. The output neurons work as hidden neurons but without activations, i.e. as z1, . . . , zn 7 Pn k=1 wkzk + h. As defined, neural networks are fairly restrictive in terms of used operations: for example, they contain multiplications of signals z by constants (weights) w, but not multiplications between signals, z1z2. However, this and some other extra operations can be implemented with any accuracy using suitable combinations of neurons. The universal network in [29] uses a fixed architecture graph, activations sin and arcsin, 37th Conference on Neural Information Processing Systems (Neur IPS 2023). σ = sin σ = arcsin z1, z2 7 z1z2 = Figure 1: Left: A (slightly simplified) fixed-size universal neural network approximating any f C([0, 1]) from [29]. Most hidden neurons are standard neurons z1, . . . , zn 7 σ(Pn k=1 wkzk+h) with one of the activation functions σ (identity, sin or arcsin). Also, there are several multiplication neurons. The parameters of the model are all the weights wk, h in the standard neurons. Right: A multiplication neuron can be replaced by a group of standard neurons while preserving universality. and can approximate any f C([0, 1]) (see Figure 1). Another fixed-size universal network had been constructed earlier in [15], but the activation was not an elementary function in that work. The Kolmogorov(-Arnold) Superposition Theorem [10] shows that any function f C([0, 1]d) can be expressed in terms of a fixed number (only depending on d) of summations and univariate continuous functions. This implies that multivariate universal formulas can be easily constructed from univariate ones, by plugging the latter in the Kolmogorov ansatz. Accordingly, the question of multivariate universal formulas is not significantly more interesting or complicated than its counterpart for univariate formulas, and in this paper restrict our attention to the latter. We also mention in passing another interesting property of universal formulas: they naturally give rise to universal polynomial differential equations P(x, f, f , . . . , f (n)) = 0, with a polynomial P and a dense set of solutions in C([a, b]), see [21, 2, 1, 19, 13]. (In fact, one can find such a polynomial for any parameterized family of elementary functions; the density of solutions then follows from the universality of the formula.) See [22] for a connection to analog computers. A foundational theorem on neural networks is the universal approximation theorem (UAT). It states that, for a broad class of activation functions, a neural network with a single layer of hidden neurons can approximate any continuous function on a compact set if we increase the number of neurons [4, 18]. This result is very different from, and should not be confused with the above fixed-size universal networks of [15] or [29]. In UAT, universality is achieved by increasing the number of parameters, and accordingly UAT imposes much weaker requirements on the activation functions (e.g., it holds for any continuous non-polynomial activation, see [14, 18]). On the other hand, the fixed-size universal network in Figure 1 is fairly simple. One of the two activation functions used there, sin, is actually used as an activation in practical multi-layer networks (e.g., the popular SIREN network of [23]). This raises the following natural question motivating the present paper: can practically used neural-network-type models be universal as fixed-size formulas (i.e., without increasing the number of neurons)? There is an important caveat here: the universality of universal formulas is, of course, an idealization that requires their parameters to have unbounded precision or magnitude. If parameters are finitely defined so as to be representable on real computer, one can give a simple entropy bound constraining feasible approximation accuracy. Specifically, suppose that a formula contains p parameters, each somehow represented using B bits (e.g., B = 32 or 64 for standard floats). Suppose we are approximating a function f at N points x1, . . . , x N with absolute accuracy ϵ. Suppose finally that |f| is bounded by M and f(xk), k = 1, . . . , N, can be chosen independently subject to this constraint. Then N log2(M/ϵ) p B. (2) This shows that in a practical application with parameters implemented as standard floats, universal formulas such as (1) can typically produce only crude approximations. In the sequel, we will ignore these finite-precision aspects. Our contribution. Our goal in this paper is to understand which structural properties are critical for universal formulas. This question does not seem to be simple. As a general strategy of tackling it, we analyze a sequence of progressively more complex parametric functional families with respect to several relevant classes of model expressiveness. Our specific contributions are as follows. 1. We introduce a hierarchy G0 G1 G2 of classes of model expressiveness (Section 2). Here G0 represents the families of infinite VC-dimension, G1 the families achieving approximation on arbitrary finite sets, and G2 the families achieving uniform approximation on the whole segment [0, 1]. Universal formulas correspond to the class G2. The class G0 is important because elementary functions belonging to it must involve, in some form, the function sin acting on an unbounded domain (Section 3). We argue that the class G1 is also important because, as we show, some functional families cannot belong to it due to algebraic (polynomial) constraints. 2. As a first result on algebraic constraints, we consider the family of fixed-size linear combinations of sine waves with unbounded weights (equivalently, fixed-size single-hidden-layer neural networks with activation sin) and prove that it belongs to the class G0 \ G1. Moreover, we give a complete description of the limit points of this family (Section 4). 3. As a generalization of the previous result, we introduce a general family of polynomiallyexponentially-algebraic expressions of bounded complexity, and prove that it is also subject to polynomial constraints and lies outside G1 (Section 5). 4. We give an application of the previous result to branching expressions and neural networks with conventional activations (Section 6). Specifically, we prove that if the network has a bounded complexity and possibly multiple layers, but only one layer with transcendental activation functions (such as sin x, Gaussian or standard sigmoid), then the network is subject to polynomial constraints and lies outside G1. Moreover, polynomial constraints hold even if the activation functions are only piecewise analytic, as is common in practice. 5. The situation is drastically different for formulas involving compositions of non-polynomial analytic functions and sin: such formulas generally belong to G1 (Section 7). It seems hard to separate the classes G1 \ G2 and G2. We give a couple of examples of families for which we can prove that they belong to G1 \ G2. One of these families can be described as that of fixed-complexity two-hidden-layer networks with sin activation that have a single neuron in the second hidden layer. We conjecture that general sine networks of fixed complexity also belong to G1 \ G2. Most proofs are deferred to the appendix (the respective sections are indicated in the theorems). 2 The hierarchy of expressiveness classes Throughout the paper, we consider various families H of functions g : [0, 1] R. We define for them expressiveness classes G0, G1, G2: H G0 def Vapnik-Chervonenkis dimension VCdim(H) = ; H G1 def For any finite subset {x1, . . . , x N} [0, 1], any y1, . . . , y N R and any ϵ > 0 there is g H such that |g(xk) yk| < ϵ for all k = 1, . . . , N. H G2 def For any f C([0, 1]) and ϵ > 0, there is g H such that g f = supx [0,1] |g(x) f(x)| < ϵ. Here, VCdim(Hf) = means that for any N N we can find a size-N subset X [0, 1] shattered by H thresholded at 0, i.e. such that for any S X there is g H for which g(x) > 0 if x S and g(x) 0 if x X \ S. Clearly, G0 G1 G2. (3) The classes G1 and G2 reflect approximability on arbitrary finite sets and on the whole interval [0, 1], respectively. Universal formulas as discussed in the Introduction correspond to the class G2. One can say that G1 corresponds to universality in a weaker sense, which is particularly relevant if we try to learn the parameters using a finite training set {(xk, yk)}. We will consider several families H that we will distinguish by superscripts denoting or enumerating their types (e.g. Hf or H(1), H(2), . . .) and by subscripts reflecting complexity within the type (e.g., H(2) N will denote linear combinations of N sine waves). Most of these families admit natural interpretations in terms of conventional neural networks we will mention them along the way. While the remainder of this work focuses on the mathematical analysis of the relation between various functional families and the classes Gk, let us briefly clarify the role of these classes from the general perspective of approximation and learning. 1. Most conventional learning models (e.g., neural networks) can be said to be defined using (piecewise) elementary functions. If the model is assumed to have a bounded complexity (e.g., a bounded number of neurons) and its weights and signals are bounded, this model falls outside even the broadest class G0 (see Section 3 below). This does not apply, however, to models containing the function sin acting on an unbounded domain. 2. For a fixed-complexity model, being outside the classes Gk and thus not having the full approximation power is not a practically negative property: this just means that the complexity of the model needs to increase to achieve a better fit. 3. The situation when a fixed-complexity model belongs to the class G2 (i.e., is a universal formula) is mathematically interesting but somewhat exotic from the perspective of conventional computation, since the respective model parameters need to contain a potentially unbounded amount of information. 4. The class G1 \ G2 seems especially dangerous from the generalization point of view: given a hypothesis space H from this class, we can fit a model arbitrarily well to a generic continuous ground truth on any finite training set, but can hardly ever uniformly approximate this ground truth on the whole domain. This cannot be said about the class G2. In Section 7 we give examples of functional families (neural networks) provably belonging to this dangerous class G1 \ G2. 3 Pfaffian functions There is an important broad class of parameterized functions f(x, w) for which one can guarantee that the family Hf obtained by fixing various w has a finite VC-dimension. This class is most naturally described in terms of Pfaffian functions introduced by [9]. Pfaffian functions can be defined using function sequences in which, at each step, the next function satisfies polynomial first-order differential equations involving itself and previous functions (see [9, 5, 30] for details). In particular, Pfaffian functions include all elementary functions when the latter are defined on suitable domains. Importantly, in contrast to functions like ln, exp and polynomials, which are Pfaffian on their whole natural domain of definition, the function sin is not Pfaffian on the whole set R, though it is Pfaffian on any bounded interval (a, b) (moreover, the representation of sin as a Pfaffian function depends on (a, b) and becomes more complex as b a increases). For this reason, elementary functions are Pfaffian only when they involve sin (or a closely related function, like cos) restricted to bounded intervals. The fundamental theorem of [9] gives an explicit finite upper bound on the number of zeros of Pfaffian functions, which implies, in particular, that the respective parametric families have finite VC-dimension. We state this result specialized to elementary functions and in the form suitable for our purposes; see [9, 5, 30, 29] for details. Theorem 1. Suppose that a formula y = f(x, w) is constructed from the variables x and w1, . . . wd using real numbers, standard arithmetic operations (+, , , /), elementary functions ln, exp, sin, arcsin, and compositions. Suppose that there is a nonempty subset W Rd such that for any w = (w1, . . . , wd) W the function f( , w) : [0, 1] R) is well-defined, i.e., ln and arcsin are applied on the intervals (0, ) and ( 1, 1), respectively, and there is no division by 0. Moreover, suppose that there is a bounded interval (a, b) to which the arguments of sin always belong for all w W. Then the family Hf = {f( , w)}w W has a finite VC-dimension. This theorem shows that an elementary function has chances to be universal only if it includes sin (or some related function) acting on an unbounded domain. Therefore, we will focus on these latter functions in the sequel. It is crucial for Theorem 1 that all the computations in the formula are real-valued and not complexvalued (the proof eventually boils down to Rolle s theorem, which does not hold for C-valued functions). In particular, we cannot bypass the domain restriction of sin by simply writing sin x = (eix e ix)/2i. In contrast, many results in the next sections will be essentially algebraic and will hold for complex-valued operations. We remark that Sontag [25] considers a class of formulas closely related to Pfaffian functions and proves that shattering k-point sets in general position requires at least (k 1)/2 parameters in such formulas. 4 Linear combinations of sines A simple example of a family with an infinite VC-dimension is H(1) = {c sin(ωx)}c,ω R. This is usually proved (see e.g. [28], Section 3.6) by giving a particular example of an arbitrarily large finite set shattered by H(1). For our purposes, however, it is instructive to describe a general collection of finite sets on which H(1) has universal approximability. We say that numbers x1, x2, . . . are rationally independent if they are linearly independent over the field Q of rational numbers. Theorem 2. Let X = {x1, . . . , x N} be a finite subset of [0, 1] such that the values x1, . . . , x N are rationally independent. Then the family H(1) can approximate any R-valued function on X. Proof. Consider the torus TN = (R/2πZ)N; its points can be identified with the points of the cube [0, 2π)N. Let : R [0, 1) be the standard floor function. By a classical Kronecker s theorem [11, 7], the set {(2π(θx1 θx1 ), . . . , 2π(θx N θx N ))}θ R is a dense subset of TN if and only if the numbers x1, . . . , x N are rationally independent. Using the 2π-periodicity of sin, this implies immediately that if x1, . . . , x N are rationally independent, then {(sin(ωx1), . . . , sin(ωx N))}ω R is dense in ( 1, 1)N and hence {(c sin(ωx1), . . . , c sin(ωx N))}c,ω R is dense in RN. This result shows that the family H(1) G0. It is also clear that H(1) / G1 (e.g., since g(0) = 0 for any g H(1)). In fact, the sufficient condition of rational independence in the theorem is close to also being necessary. In particular, a simple kind of sets not shattered by H(1) are all nontrivial arithmetic progressions xk = u + vk, v = 0, of length 5. Indeed, it follows from the identity sin(u + (k 1)v) + sin(u + (k + 1)v) = 2 sin(u + kv) cos(v) that the values of functions from H(1) cannot, for example, have the sign configuration + + + + on such progressions. Consider now the more general family consisting of all linear combinations of a fixed number of sine waves with arbitrary frequencies and phases: H(2) N = n N X n=1 cn sin(ωnx + hn) o cn,ωn,hn R,n=1,...,N. (4) One can view this family as representing single-hidden-layer neural networks with the sin activation function (ωn and hn are the weights and biases in the hidden layer, and cn are the weights in the output layer). One can again easily show that H(2) N G0 \G1, by observing that the values of all functions g H(2) N are constrained by common nontrivial polynomial equations: 1. For any g H(2) N , x0 R and αn, βn R, n = 0, . . . , 2N, det A = 0, where A = g(x0 + αk + βm) 2N k,m=0. (5) 2. det A is a nonzero polynomial in the variables g(x), where x X = {x0+αk+βm|k, m = 0, . . . , 2N}, if and only if all αk are different and all βm are different. Proof. 1. The matrix A is degenerate because each of its 2N + 1 rows is a linear combination of 2N vectors (esωnβ1, . . . , esωnβN ) with n = 1, . . . , N and s = i. 2. If some αk or βm are repeated, then the matrix A contains repeated columns or rows and so the polynomial det A identically vanishes. Conversely, suppose that all αk are different and all βm are different. Without loss, we can assume α0 < . . . < α2N and β0 < . . . < β2N. We can then expand det A = g(x0 + α2N + β2N) det A + . . . , where det A is the top left 2N 2N minor and the terms . . . do not contain the variable g(x0 + α2N + β2N). Repeating the expansion for A , etc., we see that the polynomial det A contains the monomial Q2N k=0 g(x0 + αk + βk) with coefficient 1, i.e. is nonzero. This result shows that there are (infinitely many) nontrivial polynomial constraints on the values of the functions g H(2) N . Importantly, these constraints are preserved under pointwise limits, so, in particular, H(2) N G0 \ G1. Taking N = 1 and αk = βk = vk, k = 0, 1, 2, we obtain a nontrivial constraint for the values of g on any nontrivial length-5 arithmetic progression xk = u + vk, in agreement with an earlier observation. In fact, thanks to the semi-linear structure of the family H(2) N , we can go further and give an explicit description of its limit points. It turns out that the limiting functions can only differ from the original combinations of sine waves by resonant factors xm: Theorem 4 (A). Let N be fixed. Then a function f is a limit point of H(2) N in the sense of uniform convergence on the segment [0, 1] if and only if m=0 c0mxm + m=0 ckmxm sin(ωkx + hkm), (6) where M0 0, Mk 1 for k = 1, . . . , K, and PK k=0 Mk = N. Resonant factors appear by taking suitable combinations of sine waves at close frequencies, e.g. x sin(ωx) = lim x 0( ω) 1(sin((ω + ω)x) sin(ωx)). We show explicitly how to construct resonances of arbitrary degrees subject to the constraint PK k=0 Mk = N. Note that the frequency ω = 0 corresponding to the first sum in expansion (6) is special as it is associated with a potential double-degree resonance. The more difficult part of the proof is the only if part, i.e. that any limiting function has the form (6). Our basic idea is to show that a limiting function is subject to a linear differential equation describing waves with a given set of frequencies; the general resonant form (6) would then appear from the usual solution of such an equation with nontrivial multiplicities of the frequencies. This strategy, however, is not so easy to implement rigorously, since in our setting the frequencies are unbounded. We give two alternative proofs overcoming this difficulty. The first, clustering-based proof, groups the frequencies found in a converging sequence fn H(2) N into spatially separated frequency clusters , and then shows that unbounded clusters can be removed without changing the limit. The second, discretization-based proof, starts with describing the discretized functions of H(2) N by suitable linear finite-difference equations which also hold for the limiting function f . Then, we consider these discretizations at different length scales and show that they can only be consistent with each other and with the continuity of the uniform limit f if representation (6) holds. 5 Polynomially-exponentially-algebraic expressions Theorem 3 shows that the values of functions H(2) N are subject to nontrivial polynomial constraints preserved under pointwise limits. We extend this argument to a much larger family of functions, involving complex algebraic operations. This generaliztion will allow us, in particular, to prove a related result for multi-layer neural networks. If U CN is an open domain, we say that a holomorphic function Q : U C is an algebraic function if there exist N-variable polynomials ϕ0, . . . , ϕq, not all zero, such that k=0 ϕk(z1, . . . , z N)Qk(z1, . . . , z N) = 0, z = (z1, . . . , z N) U. (7) The value q is called the degree of the algebraic function Q. Algebraic functions include, in particular, all polynomials and (on their holomorphy domains) rational functions and powers za with rational a. Arithmetic operations and compositions applied to algebraic function produce again algebraic functions. We define the family H(3) N,q,r,d as consisting of all functions g : [0, 1] R of the form g(x) = Q(x, e P1(x), . . . , e PN(x)), (8) where P1, . . . , PN are complex-valued polynomials of degree d, and Q is an algebraic function of degree q such that all polynomials ϕk appearing in its definition have degree r. Here, the functions Q are assumed to be defined on some domains U CN+1 that contain all the values (x, e P1(x), . . . , e PN(x)), x [0, 1]. Our main result is the following extension of Theorem 3. Theorem 5 (B). Fix N, q, r, d. Let m = (q + 1) N+r+1 N+1 + (max(1, d) + 1)(N + 1). Then there exists a nonzero polynomial R of m+1 variables such that for any g H(3) N,q,r,d and any a, b [0, 1] R(g(a), g(a + h), g(a + 2h), . . . , g(b)) = 0, h = b a As a corollary, since the polynomial R is the same for all g H(3) N,q,r,d, constraint (9) is preserved by pointwise limits, so H(3) N,q,r,d / G1. Our proof of Theorem 5 is inspired by the theory of algebraic-transcendent functions [16, 17, 8] that addresses representability of functions by algebraic differential equations. Our context is rather different: we are interested in the pointwise convergence of functions (or convergence in the uniform norm), so conditions involving derivatives are not applicable in our setting. We observe, however, that some elements of this theory can be extended from the usual to discretized derivatives. This is why, in particular, the conditions in Eq. (9) are formulated for a regular grid of points a, a + h, . . . , b. Since the polynomials Pn in the definition of H(3) N,q,r,d are allowed to be complex-valued, the families H(3) N,q,r,d subsume the sine wave families H(2) N considered in the previous section: H(2) N H(3) 2N,1,1,1. (10) 6 Branching expressions and neural networks We describe now an application of Theorem 5 to neural networks. We start by noting that many standard activation functions can be classified as piecewise polynomial or (piecewise) algebraicexponential1: Piecewise polynomial: Binary step function σ(x) = 1[x 0], Re LU σ(x) = max(0, x), leaky Re LU σ(x) = max(ax, x), squared Re LU σ(x) = max2(0, x) [24]. (Piecewise) algebraic-exponential: sin x, tanh x, standard sigmoid σ(x) = (1 + e x) 1, ELU σ(x) = a(ex 1), x < 0 x, x 0 [3], swish σ(x) = x/(1 + e x) [20], Gaussian σ(x) = e x2. Piecewise here means that R is in general divided into several subsets (in fact, up to two subsets in the above examples) on which the activation is defined by different analytic formulas. We will refer to this as branching . On each branch, activations of the first type are polynomials, and those of the second type can be represented in the form (8) with a polynomial or rational Q. Define the family H(4) N of functions g : [0, 1] R as realization, with various weight assignment, of the following class of feedforward neural networks. Suppose that the underlying directed acyclic graph of the network contains at most N neurons. Suppose that each hidden neuron is equipped with one of the piecewise polynomial or algebraic-exponential activation functions mentioned above (possibly different at different neurons). Suppose finally that any directed path from the input to the output neuron contains at most one (piecewise) algebraic-exponential neuron (for example, this is the case if the network has a single algebraic-exponential layer, while the other layers are piecewise polynomial). Then we have 1Of course, there also exist standard activations not of these two types, e.g. softplus σ(x) = ln(1 + ex) [6]. Theorem 6 (C). Given N, there exists m and a nontrivial polynomial R of m + 1 variables such that for any g H(4) N and any a, b [0, 1] R(g(a), g(a + h), g(a + 2h), . . . , g(b)) = 0, h = b a The main obstacle in deriving this theorem from Theorem 5 is the branching in the activations that destroys the global analytic structure of functions g H(4) N . We overcome this obstacle using the well-known Van der Waerden s theorem on arithmetic progressions. Theorem 7 ([27]; see, e.g., [26] for a short proof). For any integers s, p 1 there exists an integer Nvd W(s, p) 1 such that every coloring C : {1, ..., Nvd W} {1, ..., p} of {1, ..., Nvd W } into p colors contains at least one monochromatic arithmetic progression of length s (i.e. a progression in {1, ..., Nvd W } of cardinality s on which C is constant). 7 Compositions of sines and non-polynomial functions The polynomially-exponentially-algebraic structure (8) is actually close to being necessary for the polynomial constraints of Theorem 5, as we show by the following example. Let σ be a real analytic function on the interval [ 1, 1]. Let f(x) = c sin(ωσ(bx) + h) with parameters c, b, h R and b [ 1, 1]. Denote the respective family by Hσ. Theorem 8 (D). Hσ G1 if and only if σ is not a polynomial. This shows that the polynomial constraints of Theorem 5 are destroyed if we relax even slightly the requirement of polynomial arguments of the exponentials in Eq. (8). They are also destroyed if we replace the algebraic function Q by the sine function. While Hσ G1 for non-polynomial σ, we can show that Hσ is never in G2, by directly finding all limit points of Hσ under uniform limits. Let us call the limit points of Hσ not belonging to Hσ nontrivial. Theorem 9 (E). For a non-constant σ, the only nontrivial limit points of the family Hσ are a0 + arxr, aσ(bx) + c, and c sin(a0 + arxr), where r = arg min(m > 0 : dmσ dxm (0) = 0) and a0, ar, a, b, c R. In particular, Hσ / G2. The families Hσ with non-polynomial σ thus provide examples of elements of the class G1 \ G2 of finitelybut not globally-universal formulas. Consider now a more complex family H(5) N consisting of the functions f(x) = c sin(g(x)) + h, g H(2) N , (12) where H(2) N consists of linear combinations of N sine waves as defined earlier in Eq. (4). One can interpret the family H(5) N as a two-hidden-layer neural network with the sin activation function, N neurons in the first hidden layer, and a single neuron in the second hidden layer. Using the same approach of examining the limit points and invoking additional topological arguments, we prove that the family H(5) N belongs to the same class as Hσ: Theorem 10 (F). For any N 1, H(5) N G1 \ G2. It appears that the class G1 \ G2 is quite generic we conjecture that an analog of Theorem 10 holds for any multi-layer neural network with the sin activation and a bounded number of neurons. However, proving such a general result by explicitly describing the limit points as in Theorems 9, 10 is difficult. We are not aware of simple conditions allowing to separate G1 \ G2 from G2. In this context, let us discuss the structure of universal formulas mentioned in the Introduction. Boshernitzan s formula (1) includes the factor bd 1+d2 cos(bt) which is of the form (8) and hence in G1, and the transcendental composition cos(et). This agrees with our results showing that compositions of sines and non-polynomial functions are essential for universal formulas. However, Boshernitzan s formula also includes integration, which effectively serves to produce an approximation to a piecewise constant function. The neural network in Figure 1 and the formulas of [13] combine sines with arcsines, which serves to approximate periodic piecewise linear or piecewise constant functions. It seems that approximate piecewise constant functions are an essential element of existing universal formulas, and so one can conjecture that arcsine or a related inverse trigonometric function is a necessary component of an elementary universal formula without integration. In the fairly different context of recursive decidability, there is a remarkably similar sharp difference between polynomial-sine and more complex compositions [12]. Let S1 denote the ring generated by the integers and the expressions x, sin xn and sin(x sin xn)(n = 1, 2, . . .). Let also S2 denote the ring generated by the integers and the expressions sin xn and cos xn(n = 1, 2, . . .). Then the predicate that there exists a real x such that f(x) > 0 is undecidable for f S1, but decidable for f S2. 8 Discussion The expressiveness hierarchy. To analyze the necessary features of the structure of universal formulas, we have examined the chain of expressiveness classes G0 G1 G2 corresponding, respectively, to the families having the infinite VC-dimension, general approximability on all finite sets, and general global uniform approximability. Membership in the class G0 can be conveniently studied using the theory of Pfaffian functions, while, as we have shown, membership in the class G1 can be conveniently studied using algebraic methods. In particular, our results suggest that globally universal formulas need to involve suitable compositions of non-polynomial and sine operations. Transcendental structure of universal formulas. Our main result in Section 5 shows that combining a single layer of exponential and sine operations with polynomials and algebraic functions has a fairly weak expressiveness, in the sense that such formulas of bounded complexity are constrained by polynomial relations and do not even have general approximability on finite sets. In this family, the requirement on the operations preceding the exponential/sine layer is much stronger (polynomial) than the requirement on the subsequent operations (algebraic). As Theorem 8 shows, any non-polynomial operation preceding the sine operation and combined with linear operations restores the general finite approximability. General methods. All results presented in this paper that rule out global universality are essentially obtained by one of the three general methods: Khovansky s analytic theory of Pfaffian functions (Theorem 1), finding algebraic constraints (Theorems 3, 5, 6), and direct computations of limit functions (Theorems 4, 9, 10). It appears that all these methods are insufficient to conveniently separate the finite approximability class G1 \ G2 from the global approximability class G2. Only the last of the three methods seems useful for that, but it seems hard to find the full set of limit points for more complex families H similarly to how we did that for the linear combinations of sines (Theorems 4) and the families Hσ, H(5) N (Theorems 9, 10). It would be interesting to have a general method suitable for separating G1 \ G2 from G2. Limit points. Theorems 4 and 9 show that the nontrivial limit points of the families H(2) N and Hσ are some sort of resonances, reflecting some degeneracies of the model. The set of nontrivial limit points is small and explicit in these two cases. However, in a universal formula this set is the whole C([0, 1]). It would be interesting to see how general is this dichotomy, e.g. whether the set of nontrivial limit points of some formula can be a small but not explicitly describable subset of C([0, 1]). Neural networks. In Section 6 we described a general family of neural networks lying outside the class G1 and so non-universal. These networks can have a single layer of transcendental neurons (with activations such as sin, tanh, e x2) as well as some layers with piecewise-polynomial activations. We have shown that though the resulting functions are only piecewise analytic, they still admit polynomial constraints. It remains, however, an open question whether networks with multiple transcendental layers involving the sine activation can be universal. Theorem 8 shows that such networks belong to G1, while Theorem 10 shows that the two-hidden-layer sine network of bounded complexity and with a single neuron in the second hidden layer belongs to G1 \ G2. We conjecture that, without arcsin or other special functions that help create periodic piecewise-linear dependencies, general sin networks of fixed complexity belong to G1 \ G2. Acknowledgments The author thanks the anonymous reviewers for several useful suggestions. Research was supported by Russian Science Foundation, grant 21-11-00373. [1] Michael Boshernitzan. Universal formulae and universal differential equations. Annals of mathematics, 124(2):273 291, 1986. [2] R Creighton Buck. The solutions to a smooth PDE can be dense in C (I). Technical report, WISCONSIN UNIV-MADISON MATHEMATICS RESEARCH CENTER, 1980. [3] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). ar Xiv preprint ar Xiv:1511.07289, 2015. [4] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. [5] Andrei Gabrielov and Nicolai Vorobjov. Complexity of computations with pfaffian and noetherian functions. Normal forms, bifurcations and finiteness problems in differential equations, 137:211 250, 2004. [6] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315 323, 2011. [7] Steven M Gonek and Hugh L Montgomery. Kronecker s approximation theorem. Indagationes Mathematicae, 27(2):506 523, 2016. [8] Erik Christian Hansen, Yonathan Stone, and Jesse Wolfson. Alexander Ostrowski s "On Dirichlet Series and Algebraic Differential Equations". ar Xiv preprint ar Xiv:2211.02088, 2022. [9] A. G. Khovanskii. Fewnomials. Vol. 88 of Translations of Mathematical Monographs. American Mathematical Society, 1991. [10] Andrei Kolmogorov. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. In Doklady Akademii Nauk, volume 114, pages 953 956. Russian Academy of Sciences, 1957. [11] Leopold Kronecker. Näherungsweise ganzzahlige Auflösung linearer Gleichungen. Monats. Königl. Preuss. Akad. Wiss. Berlin, pages 1179 1193, 1271 1299, 1884. [12] Miklós Laczkovich. The removal of π from some undecidable problems involving elementary functions. Proceedings of the American Mathematical Society, 131(7):2235 2240, 2003. [13] Miklós Laczkovich and Imre Z Ruzsa. Elementary and integral-elementary functions. Illinois Journal of Mathematics, 44(1):161 182, 2000. [14] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861 867, 1993. [15] Vitaly Maiorov and Allan Pinkus. Lower bounds for approximation by mlp neural networks. Neurocomputing, 25(1-3):81 91, 1999. [16] Eliakim Hastings Moore. Concerning transcendentally transcendental functions. Mathematische Annalen, 48(1-2):49 74, 1896. [17] Alexander Ostrowski. Über dirichletsche Reihen und algebraische Differentialgleichungen. Mathematische Zeitschrift, 8(3):241 298, 1920. [18] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143 195, 1999. [19] Amaury Pouly and Olivier Bournez. A universal ordinary differential equation. Logical Methods in Computer Science, 16, 2020. [20] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. ar Xiv preprint ar Xiv:1710.05941, 2017. [21] Lee A Rubel. A universal differential equation. Bulletin (new series) of the american mathematical society, 4(3):345 349, 1981. [22] Lee A Rubel. The extended analog computer. Advances in Applied Mathematics, 14(1):39 50, 1993. [23] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems, 33:7462 7473, 2020. [24] David R So, Wojciech Ma nke, Hanxiao Liu, Zihang Dai, Noam Shazeer, and Quoc V Le. Primer: Searching for efficient transformers for language modeling. ar Xiv preprint ar Xiv:2109.08668, 2021. [25] Eduardo D Sontag. Shattering all sets of k points in general position requires (k 1)/2 parameters. Neural Computation, 9(2):337 348, 1997. [26] Terence Tao. A quantitative ergodic theory proof of Szemerédi s theorem. ar Xiv preprint math/0405251, 2004. [27] Bartel Leendert Van der Waerden. Beweis einer baudetschen Vermutung. Nieuw Arch. Wiskunde, 15:212 216, 1927. [28] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999. [29] Dmitry Yarotsky. Elementary superexpressive activations. In International Conference on Machine Learning, pages 11932 11940. PMLR, 2021. [30] Thierry Zell. Betti numbers of semi-pfaffian sets. Journal of Pure and Applied Algebra, 139(1-3):323 338, 1999. 1 Introduction 1 2 The hierarchy of expressiveness classes 3 3 Pfaffian functions 4 4 Linear combinations of sines 5 5 Polynomially-exponentially-algebraic expressions 6 6 Branching expressions and neural networks 7 7 Compositions of sines and non-polynomial functions 8 8 Discussion 9 References 10 A Proof of Theorem 4 12 B Proof of Theorem 5 21 C Proof of Theorem 6 24 D Proof of Theorem 8 25 E Proof of Theorem 9 25 F Proof of Theorem 10 26 A Proof of Theorem 4 Sufficiency (each function (6) is a limit point of H(2) N ). First observe that each resonant component xm sin(ωx + h) is a uniform limit of m + 1 sine waves linearly combined to match the discretized derivative dm dωm sin(ωx + h πm/2): xm sin(ωx + h) = ds dωm sin(ωx + h πm/2) (13) = lim ω 0( ω) m m X n=0 ( 1)m n m n sin((ω + n ω)x + h πm/2). (14) Note also that any finite linear combination PM 1 m=0 cm sin(ωx + hm) can be written as a single sine wave c sin(ωx + h) with suitable amplitude c and phase h. It follows that we can obtain any linear combination PM 1 m=0 cmxm sin(ωx + hm) as a uniform ω 0 limit of linear combinations m=0 cm, ω sin((ω + m ω)x + hm, ω) (15) with suitable values of cm, ω and hm, ω. This shows in turn that the second component of expression (6), K X m=0 ckmxm sin(ωkx + hkm), (16) can be obtained by limits of linear combinations of PK k=1 Mk sine waves with suitable phases and frequencies. It remains to show that the first component of expression (6), P2M0 1 m=0 c0mxm, can be obtained as a limit of linear combinations of M0 sine waves. Observe that for any constant x0 (x x0)2M0 1 = lim ω 0( ω)2M0 1 sin2M0 1( ωx ωx0). (17) Since 2M0 1 is odd, sin2M0 1( ωx ωx0) is a linear combination of M0 sine waves with frequencies (2s 1) ω, s = 1, . . . , M0 (and some phases). By a previously given argument, this implies that any linear combination of monomials (x x0)2M0 1 can also be obtained as a limit of some linear combinations of M0 sine waves with frequencies (2s 1) ω, s = 1, . . . , M0. But the monomials (x x0)2M0 1 linearly span the space of polynomials of degree 2M0 1. It follows that any such polynomial can be obtained as a limit of linear combinations of M0 sine waves, as claimed. Necessity (each limit point of H(2) N has the from (6)). We give two alternative proofs. Proof 1 ( clustering-based ). Suppose that a function f is the uniform limit of functions fr H(2) N . We identify the different frequency clusters associated with the sequence fr. Let 0 ω1,r . . . ωN,r be the N sorted frequencies appearing in the sine wave expansion of fr; we assume here without loss of generality that all the frequencies are nonnegative. For each n = 1, . . . , N 1 one of the following three options holds: 1. either ωn+1,r ωn,r r ; 2. or supn(ωn+1,r ωn,r) < ; 3. or none of the above holds. For each n, by taking a subsequence of fr, we can exclude option 3. By taking subsequences N times, we can then assume without loss of generality that for each n either option 1 or option 2 holds. It follows that we can partition the set {1, . . . , N} into contiguous subsets I1 = {1, 2, . . . , i1}, I2 = {i1 + 1, . . . , i2}, . . . , Is = {is 1 + 1, . . . , N} such that for some constant Ω |ωn,r ωn ,r| (r , if t : n, n It; Ω r, if t : n, n It. (18) The sets It thus represent the largest clusters of frequencies that remain at a bounded distance from each other in the limit r . Clearly, there is at most one cluster whose frequencies remain bounded in the limit r , namely It0 with t0 = 1. It will be convenient to assume that It0 has bounded frequencies but possibly is empty. For any M = 1, 2, . . . and Ω> 0 let us denote by H(2) M,Ωthe set of functions f : [0, 1] C of the form n=1 cneiωnx, (19) where cn C are arbitrary amplitudes and ωn [ Ω, Ω]. Lemma 1. For any f H(2) M,Ωand n = 1, 2, . . . max x [0,1] |f (n)(x)| 2M(M+1)(1 + Ω)M max(n,M 1) max x [0,1] |f(x)|. (20) We remark that the constant in Eq. (20) is likely suboptimal, but it must certainly grow with M, since, as the proof of the sufficiency part of Theorem 4 shows, arbitrarily narrow frequency bands are sufficient to generate any degree-M polynomial. Proof. The function f satisfies the differential equation Df = 0, where dx iωn) = d M dx M with some constants bn. Consider the vector-valued function f(x) = (f(x), f (x), . . . , f (M 1)(x))T . It satisfies the differential equation d dxf = Af (22) with the matrix A = (anm)M 1 n,m=0, where 1, n + 1 = m and n < M 1, bm, n = M 1, 0, otherwise. (23) The solution of this equation is f(x) = e A(x x0)f(x0). (24) If is any norm on the CM, then f(x) f(x0) (e A |x x0| 1) f(x0) , (25) where A is the respective matrix norm. Let be the maximum (l ) norm, then m |anm| = max X m=1 (1 + |ωm|) (1 + Ω)M. (26) Now let x0 = arg maxx [0,1] f(x) and let n0 = arg maxn=0,...,M 1 |f (n)(x0)| so that f(x0) = |f (n0)(x0)|. If n0 = 0, then max x [0,1] |f (x)| max x [0,1] |f(x)|. (27) Suppose that n0 > 0. Observe that if |x x0| ln 2 2 (1 + Ω) M, then |f (n0)(x) f n0(x0)| f(x) f(x0) (28) (e A |x x0| 1) f(x0) (29) (e(1+Ω)M|x x0| 1)|f (n0)(x0)| (30) 2|f (n0)(x0)|. (31) Consider the complex unit direction z0 = f (n0)(x0) |f (n0)(x0)| and denote l = ln 2 2 (1 + Ω) M. For all x such that |x x0| l we have ℜ(z0f (n0)(x)) 1 2|f (n0)(x0)|. (32) It follows that we can find in [0, 1] a subinterval of length l/4 on which either ℜ(z0f (n0 1)(x)) l 2 4|f (n0)(x0)| (33) or ℜ(z0f (n0 1)(x)) l 2 4|f (n0)(x0)|. (34) Making another iteration, we can find in [0, 1] a subinterval of length l/42 on which either ℜ(z0f (n0 2)(x)) l 42 l 2 4|f (n0)(x0)| (35) or ℜ(z0f (n0 2)(x)) l 42 l 2 4|f (n0)(x0)|. (36) Continuing in this way, for any n = n0, n0 1, . . . , 0 we find in [0, 1] a subinterval of length l/4n0 n on which either ℜ(z0f (n0 n)(x)) ln0 n 2(n0 n)(n0 n+1)+1 |f (n0)(x0)| (37) or ℜ(z0f (n0 n)(x)) ln0 n 2(n0 n)(n0 n+1)+1 |f (n0)(x0)|. (38) Taking n = 0, this implies |f (n0)(x0)| 2n0(n0+1)+1l n0 max x [0,1] |f(x)| (39) and so, by definition of x0 and n0, for any n = 0, . . . , M 1 max x [0,1] |f (n)(x)| = |f (n0)(x0)| (40) 2n0(n0+1)+1l n0 max x [0,1] |f(x)| (41) 2(M 1)M+1((2/ ln 2)(1 + Ω)M)M 1 max x [0,1] |f(x)| (42) 2M(M+1)(1 + Ω)M(M 1) max x [0,1] |f(x)|. (43) This proves the claim for n M 1. If n M, then we use again Eq. (21) and the bound P m |bm| (1 + Ω)M. Lemma 2. The L and L2 norms are equivalent on the set H(2) M,Ω: for any f H(2) M,Ω f 2 f 2[2(1 + Ω)]M(M+1)/2 f 2 (44) Proof. Only the second inequality is nontrivial. By Lemma 1, f c f , c = [2(1 + Ω)]M(M+1). (45) It x0 = arg maxx [0,1] |f(x)|, then for |x x0| 1/c we have |f(x)| (1 |x x0|c)|f(x0)| and so f 2 2 Z 1/c 0 (1 tc)2|f(x0)|2dt = 1 3c f 2 , (46) implying the desired result. Consider now the expansion of fr over different frequency clusters: t=1 ft,r, (47) where ft,r denotes the It-component of fr. We first show that we can discard the components associated with unbounded frequencies. Lemma 3. If fr converge to f uniformly on [0, 1], then ft0,r also converge to f uniformly on [0, 1]. Proof. Step 1. Let us first prove the statement in the case f = 0. Observe that each ft,r can be represented as ft,r = X q= 1 eiqωt,rxgt,r,q, (48) where gt,r,1 = gt,r, 1 and gt,r,q H(2) 2N,Ωwith some Ωindependent of t and r. We use this representation for all t except t = t0 (since ft0,r H(2) 2N,Ωalready): fr = ft0,r + q= 1 eiqωt,rxgt,r,q. (49) fr 2 2 = fr, fr = ft0,r 2 2 + q= 1 gt,r,q 2 2 + cross-terms. (50) Observe that a product of g, h H(2) M,Ωbelongs to H(2) M 2,2Ω. It follows that each cross-term can be written as ei ωr, gr , (51) where ωr r and gr H(2) (2N)2,2Ω; the function gr being a product of two functions out of the collection F = {ft0,r} {(gt,r,q)t =t0,q= 1}. We have | ei ωr, gr | = Z 1 0 e i ωrgr(x)dx (52) = | ωr| 1 Z 1 0 gr(x)de i ωr (53) | ωr| 1 2 gr + Z 1 0 e i ωrg r(x)dx (54) | ωr| 1(2 + c) gr , (55) where we used the inequality g r c gr with a constants c = c(N, Ω) as established in Lemma 1. Observe that if gr is a product of two functions from the collection F, say g, h, then we have gr g h C g 2 h 2, (56) with another suitable constant C = C(N, Ω), by Lemma 2. It follows that the cross-terms in Eq. (50) are dominated by the explicitly written terms: |cross-terms| = o ft0,r 2 2 + q= 1 gt,r,q 2 2 , r . (57) Therefore, if fr = o(1) as r , then we also have ft0,r = O( ft0,r 2) = O( fr 2) = O( fr ) = o(1) (58) and similarly all gt,r,q = o(1). (59) Step 2. Now we prove the result for a general f . If fr f 2 0, then the sequence fr is Cauchy: lim r1 lim r2 fr1 fr2 2 = 0. (60) fr1 fr2 = (ft0,r1 ft0,r2) + q= 1 eiqωt,r1xgt,r1,q q= 1 eiqωt,r2xgt,r2,q. (61) Assuming that r2 = r2(r1) and grows sufficiently fast, the frequencies ωt,r1, ωt,r2 and 0 (corresponding to the first term) will form disjoint clusters. We can then use the argument from Step 1 and conclude that all gt,r,q = o(1), implying the desired result. The last lemma shows that any uniform limit of functions fr H(2) N is also a uniform limit of band-limited functions fr H(2) 2M,Ωfor some Ω< . Any function fr H(2) N is a solution of the differential equation Drfr = 0, (62) where dx iωnr)( d dx + iωnr). (63) If the frequencies ωnr are bounded, by compactness we can find their limit points ωn. We argue now that the limiting function f is a solution of the corresponding limiting equation Df = 0 with dx + iωn). (64) Indeed, such solutions have the form f(x) = e A(x x0)f(x0), (65) (cf. Eq. (24)), while the functions fr can be written as fr(x) = e Ar(x x0)fr(x0), (66) with respective matrices Ar. As r , the matrices Ar converge (by taking a subsequence, if necessary) to A, and, by Lemma 1, fr(x0) f(x0). It follows that the function f indeed is representable in the form (65), i.e. is a solution of equation Df = 0. This immediately implies the claim of the theorem (with isolated zero limiting frequencies ωn = 0). Proof 2 ( discretization-based ). Our second proof essentially relies on discretizations of the function f and will consist of two steps. In the first step we introduce the discretizations, show that they satisfy discrete difference equations, and describe general solutions of these equations. In the second step we analyze how discretizations at different length scales agree with each other and show that the continuity of f enforces representation (6) in the zero spacing limit. Step 1: Discretization. Suppose that the function f is the uniform limit of functions fr H(2) N , r = 1, 2, . . . Fix some x > 0 and consider the sequences u = (u s) 1/ x s=0 and u(r) = (u(r) s ) 1/ x s=0 obtained by restricting these functions to the points s x: u s = f (s x), u(r) s = fr(s x), s = 0, 1, . . . , 1/ x . (67) Since fr H(2) N , each sequence u(r) can be written as a linear combination n=1 bnreiωnrs x + bnre iωnrs x, s = 0, 1, . . . , 1/ x (68) with some complex amplitudes bnr C and frequencies ωnr R (the horizontal bar denotes complex conjugation). We want to perform now the limit r and derive a similar representation for u . To this end observe that, up to the boundary values, a sequence of the form us = beiωs x satisfies the finite difference equation (T eiω x)u = 0, (69) where T is the left shift: (Tu)s = us+1. (70) Since each sequence u(r) has the form of linear combination (68) and the operators T eiω x commute, it follows that u(r) satisfies the finite difference equation D(r)u(r) def = n=1 [(T eiωnr x)(T e iωnr x)]u(r) = 0. (71) This equation holds up to the 2N boundary components u(r) s that are moved by the shifts T contained in D(r) outside the domain [0, 1] of the functions fr. We would like to perform the limit r in this difference equation, but we cannot directly perform it in the exponents iωnr x, since the frequencies ωnr may be unbounded and not have any limit points. However, the values eiωnr x belong to the unit circle in C, which is a compact set. Therefore, we can choose a subsequence rt such that eiωnrt x z n as t , with some z n C, |z n| = 1. Performing this limit in the difference equation, we then see that the limiting sequence u satisfies a difference equation n=1 [(T z n)(T z n)]u = 0. (72) A general solution of this equation is again a linear combination of sine waves, but with possible resonances due to repeated values z n. Suppose that the numbers z n contain K distinct nonreal values z1, . . . , z K with multiplicities Mk 1. Additionally, let M 0 be the multiplicities of the real values z = 1 among the numbers z n. We can then write the general real solution of equation (72) as m=0 b ,msm( 1)s + m=0 b+,msm + m=0 sm(bkmzs k + bkmzk s), (73) with M + M+ + PK k=1 Mk = N. Equivalently, we can write m=0 b ,mv ,m + m=0 b+,mv+,m + m=0 (bkmvkm + bkmvkm), (74) where we have introduced the vectors v ,m, vkm, vkm C 1/ x +1 with components (v ,m)s = sm( 1)s, (vkm)s = smzs k, (vkm)s = smzk s. (75) The coefficients bkm and b ,m in expansions (73), (74) are determined uniquely as long as x is sufficiently small: Lemma 4. Suppose that x 1/(2N 1) so that dim u = 1/ x + 1 2N. Then the 2N vectors v ,m, vkm and vkm are linearly independent and the coefficients b ,m, bkm are uniquely determined. Proof. For 1 q Mk, consider the finite difference operators Dk,q obtained by removing the factor (T zk)q from the operator D that appears in Eq. (72): Dk,q = D (T zk)q . (76) When applied to u , this operator kills all the components of the expansions (73), (74) except those associated with zk and having degree m Mk q. In particular, (Dk,1u )s = akbk,Mk 1zs k (77) with the explicit coefficient ak = (Mk 1)!(zk zk)Mk Y k { } {1,...,K}\{k} ((zk zk )(zk zk ))Mk = 0. (78) We can then find the coefficient bk,Mk 1 from Eq. (77) if there is at least one s for which it holds (recall that it is not applicable to a few boundary values of s). The highest power of the shift T appearing in Dk,1 is T 2N 1, so such s exists as long as dim u 2N. Having found bk,Mk 1, we can subtract the corresponding component bk,Mk 1vk,Mk 1 from u and apply to the remainder the operator Dk,2, which will kill all terms except bk,Mk 2vk,Mk 2. This allows to determine the next coefficient bk,Mk 2. Continuing this process and extending it to other zk and z , we uniquely recover all the coefficients b ,m, bkm. The linear independence of the vectors v ,m, vkm and vkm follows from this uniqueness. Step 2: The limit x 0. We consider now discretizations with variable spacing x. Our goal is to show that expansions (73) lead to the desired representation (6) in the limit x 0. In the notation, we will occasionally add x to the subscripts to reflect the dependence of various quantities on the discretization spacing. It is convenient to rewrite expansion (73) by rescaling the coefficients via bkm = ckm( x)m and introducing frequencies ωk and ω such that zk = eiωk x and 1 = eiω x, so as to make the expansion less x-dependent: u s = f (s x) (79) m=0 c ,m(s x)meiω s x + m=0 c+,m(s x)m (80) m=0 (s x)m(ckmeiωks x + ckme iωks x). (81) The frequencies ωk, ω are determined from the values zk, x only up to integer multiples of 2π/ x: ωk Wk, w def = n i ln(zk, x) + 2πl ω W , w def = nπ + 2πl Observe that expansion (79)-(81) is invariant under downscaling, in the following sense. Since s appears in it only through the combination s x, if this expansion holds for some x, then it also holds for any integer multiple x = l x, with the same coefficients c ,m, ckm and frequencies ω , ωk. To discuss this effect in more detail, we argue now that we can establish a correspondence between different values zk across different scales. In general, downscaling is accompanied by aliasing: if the difference equation (72) holds with the values z n, x on the length scale x, then on the integer multiple length scale l x it holds with the values z n,l x = (z n, x)l. (84) In particular, the values zk, x that are distinct on the length scale x may collide when mapped to the length scale l x. However, in our setting we can bound the number of collisions thanks to the finiteness of the set of frequencies. Specifically, consider length scales of the form 2 t x, with t = 0, 1, . . . The total number of collisions over all t is bounded, since there are at most K = N distinct values zk. It follows that at sufficiently large t there are no collisions, so that the number K = Kt of distinct values zk,2 t x is the same for all t, and we can enumerate these values so that for t2 > t1 zk,2 t1 x = z2t2 t1 k,2 t2 x. (85) Moreover, observe that for sufficiently large t there will be no ω -component (corresponding to z = 1) in the (79)-(81). Indeed, since the square roots of 1 are not real, the presence of the ω -component for some t would mean that Kt+1 > Kt. But such an increase is possible only for a finite number of t s. Similarly, the ω+-component of the expansion (79)-(81) must have the form f ,0(s x) = m=0 c+,m(s x)m (86) with the same c+,m for all sufficiently large t, since any change of this component with t is accompanied by an increase in Kt. The function f ,0 uniquely extends to a continuous function on the whole interval [0, 1], m=0 c+,mxm, (87) which agrees with the first component in the expansion (6). This discussion shows that without loss of generality (by subtracting f ,0 from f if necessary), we may assume that expansion (79)-(81) contains only the last component, standing in line (81). Assuming that transition from the length scale x to x = l x occurs without collisions in zk, the downscaling invariance condition can be written as ckm, x = ckm, x , (88) Wk, x Wk, x , (89) where the sets Wk, x are defined by Eq. (82). By Lemma 4, the coefficients ckm, x are uniquely determined as soon as x and x are small enough. Considering again the length scales of the form 2 t x, we get a nested sequence . . . Wk,2 2 x Wk,2 1 x Wk, x. (90) All the inclusions here are strict. The intersection Wk,lim = t=0Wk,2 t x (91) is either empty or consists of a single frequency {ωk}. In the latter case we can again (as previously with f ,0) identify the respective component of the function f : m=0 xm(ckmeiωkx + ckme iωkx). (92) Our goal is to prove the that the intersections Wk,lim are indeed nonempty for all k. We will obtain this as a consequence of the continuity of the function f , which in turn follows from the fact that f is assumed to be a uniform limit of continuous functions fr. To this end, observe that when upscaling from 2 t x to 2 t 1 x, the value zk,2 t 1 x is one of the two square roots of zk,2 t x. The case Wk,lim = corresponds to the case when for all sufficiently large t we take the square root with positive real part, so that zk,2 t x t 1. (93) In contrast, in the case Wk,lim = , for any t0 there is t > t0 such that zk,2 t 1 x has a negative real part, so that t 1. (94) Now consider the vector u obtained by sampling the function f at the points s x as in Eq. (67). Consider the sequence u( ,t) of vectors of the same dimension, obtained by sampling the function f with additional shifts 2 t x: u( ,t) s = f ((s + 2 t) x), s = 0, 1, . . . , dim u 1. (95) Since f is continuous, u( ,t) t u . (96) Recall that the vector u admits an expansion (74) over the vectors v ,m, vkm, vkm. As mentioned earlier, we can assume without loss of generality that there are no v ,m terms: m=0 (bkmvkm + bkmvkm). (97) Observe now that an analogous expansion can also be written for the shifted vectors: m=0 (bkmv(t) km + bkmv(t) km). (98) Moreover, the vectors v(t) km are either exact or approximate multiples of the vectors vkm: v(t) km = zk,2 t xvkm, m = 0, zk,2 t xvkm + o(1) (t ), m > 0. (99) By Lemma 4, the vectors vkm and vkm are linearly independent for sufficiently small x, so convergence (96) requires all those coefficients zk,2 t x for which bkm = 0 at least for one m to converge to 1 (as in Eq. (93)). This proves that all Wk,lim are non-empty for such k. If for some k we have bkm = 0 for all m, then this k-component is effectively missing and can be ignored. This completes the proof of the theorem. B Proof of Theorem 5 Our proof will consist of three steps. In the first step we show that the polynomial constraint claimed in the theorem holds for functions of the form e P (x). In the second step we show that if polynomial constraints hold for the arguments of an algebraic function, then its values also admit a polynomial constraint. In the final third step we combine these results to complete the proof of the theorem. We remark that our arguments are close to the arguments of the theory of algebraic-transcendent functions [16, 17], but we work in a discretized setting of uniformly sampled data without considering any derivatives. Step 1: Polynomial constraints for exponentials of polynomials. Consider the function g(x) = e P (x), (100) where P is a (possibly complex-valued) polynomial of degree d. Consider operators Ds h of forward discrete derivatives of order s with step h: Ds hg(x) = 1 h(Ds 1 h g(x + h) Ds 1 h g(x)) = . . . = h s s X k=0 ( 1)s k s k g(x + kh). (101) We have Dd+1 h P = 0 for any h C and any polynomial P of degree d : k=0 ( 1)d+1 k d + 1 k P(x + kh) = 0. (102) Separating the positive and negative coefficients, we can write this, equivalently, as P(x + 2kh) = d + 1 2m + 1 P(x + (2m + 1)h). (103) Exponentiating, it follows for g(x) given by Eq. (100) that k=0 (g(x + 2kh))( d+1 2k ) m=0 (g(x + (2m + 1)h))( d+1 2m+1) = 0. (104) We can view this as a polynomial constraint of the form R(g(x), g(x + h), . . . , g(x + (d + 1)h)) = 0 (105) with some fixed polynomial R of d + 2 variables. Step 2: Extension of polynomial constraints to algebraic functions. 1. Suppose that Q(y0, y1, . . . , y N) is an algebraic function, and g0(x), . . . , g N(x) are functions satisfying polynomial relations Rn(gn(x), gn(x + h), . . . , gn(x + sh)) = 0, x, h. (106) Let g(x) = Q(g0(x), . . . , g N(x)) and m s(N + 1). Then there exists a nontrivial polynomial R of m + 1 variables such that R(g(x), g(x + h), . . . , g(x + mh)) = 0, x, h. (107) 2. Moreover, suppose that the function Q is not fixed, and is only constrained by its degree deg Q q and the degrees deg ϕk r of the polynomials appearing in its definition. Let m (q + 1) N+r+1 N+1 + s(N + 1). Then there exists a nontrivial polynomial R of m + 1 variables such that relation (107) holds for all such Q. Proof. 1. Recall the defining relation (7) of the algebraic function Q : k=0 ϕk(z0, . . . , z N)Qk(z0, . . . , z N) = 0. (108) We can use this relation to express Qq in terms of lower powers Q0, . . . , Qq 1: Qq(z0, . . . , z N) = ϕ 1 q (z0, . . . , z N) k=0 ϕk(z0, . . . , z N)Qk(z0, . . . , z N). (109) By iterating this relation, for any power t q we can express Qt in terms of the powers Q0, . . . , Qq 1: Qt(z0, . . . , z N) = ϕ t q (z0, . . . , z N) k=0 ϕt,k(z0, . . . , z N)Qk(z0, . . . , z N), (110) where ϕt,k are suitable polynomials of degree not greater than t maxk=0,...,q deg ϕk. Substituting zn = gn(x), Qt(g(x)) = ϕ t q (g(x)) k=0 ϕt,k(g(x))Qk(g(x)), (111) where for brevity we have denoted g(x) = (g0(x), . . . , g N(x)). Formula (111) remains valid if we replace x by x + lh with any l. Consider various monomials in the variables Q(g(x + lh)) for l = 0, . . . , m : l=0 Qtl(g(x + lh)) (112) l=0 ϕ tl q (g(x + lh)) q 1 X km=0 eΦt,k(g(x), . . . , g(x + mh))Qk, (113) where t = (t0, . . . , tm), k = (k0, . . . , km), eΦt,k are some polynomials of degree not greater than (m + 1)(maxl tl) maxk deg ϕk, and l=0 Qkl(g(x + lh)). (114) Our goal is to prove that for sufficiently large m, the monomials Ft are linearly dependent (with some coefficients independent of x and h), which is exactly equivalent to the existence of a polynomial R satisfying relation (107). This is done by showing a linear dependence of the expressions (113) with different t. Specifically, observe that, regardless of t, expressions (113) contain only finitely many (namely, qm+1) different components Qk. Since there are infinitely many monomials Ft, it follows already from this observation that there exist nontrivial vanishing linear combinations of the monomials Ft if the coefficients in these linear combinations are allowed to belong to the field of all rational functions of g(x), . . . , g(x + mh). However, we need to show that nontrivial vanishing linear combinations can be found even with constant coefficients. To this end, we will further reduce the combinatorial complexity of expressions (113) by removing high powers of the variables gn(x + lh) at the expense of introducing new rational functions of the variables gn(x + l h) with l > l. Recall that each gn is subject to polynomial relation (106). We can use these relations to transform polynomial functions of g(x) into rational functions of g(x), g(x + h), . . . , g(x + sh) that contain only bounded powers of g(x). Specifically, isolate the variables gn(x) in the polynomials Rn by writing equations (106) in the form j=0 ψn,j(gn(x + h), . . . , gn(x + sh)))gj n(x) = 0 (115) with suitable polynomials ψn,j of the remaining variables gn(x + h), . . . , gn(x + sh). The value dn is the degree of gn(x) in this representation. Then, similarly to how we did this for Q, we can express gdn n (x) = ψ 1 n,dn(gn(x + h), . . . , gn(x + sh))) j=0 ψn,j(gn(x + h), . . . , gn(x + sh)))gj n(x). (116) By iterating this relation, for any t dn we get gt n(x) = ψ t n,dn(gn(x + h), . . . , gn(x + sh))) j=0 ψn,j,t(gn(x + h), . . . , gn(x + sh)))gj n(x), with some polynomials ψn,j,t of degree not greater than t maxj deg ψn,j. This relation remains valid if we replace x by x + lh with any l: gt n(x + lh) = ψ t n,dn(gn(x + (l + 1)h), . . . , gn(x + (l + s)h))) (118) j=0 ψn,j,t(gn(x + (l + 1)h), . . . , gn(x + (l + s)h)))gj n(x + lh). (119) We can now transform expression (113) by first replacing there all powers t dn of gn(x) with powers t dn and rational functions of gn(x + h), . . . , gn(x + sh), then replacing powers t dn of gn(x + h) with powers t dn and rational functions of gn(x + 2h), . . . , gn(x + (s + 1)h), and so on. We continue this process while the involved variables g(x + lh) obey the condition l m. As a result, we obtain a rational function of the variables g(x), . . . , g(x + mh) that contains only powers of the variables gn(x), . . . , gn(x + (m s)h) not exceeding dn 1. To establish the desired linear dependence, we want the rational functions appearing in these transformations of the monomials Ft to have the same denominator for all t. It is convenient to ensure this in the following way. Choose some large T and assume that all the powers tk in the monomials Ft vary between 0 and T, so that in total we have (T + 1)m+1 monomials. In expression (113), instead of the denominator Qm l=0 ϕt q(g(x + lh)) use the common denominator l=0 ϕT q (g(x + lh)). (120) This changes the polynomials eΦt,k : Ft = Z 1 0 X k Φt,k(g(x), . . . , g(x + mh))Qk; (121) the new polynomials Φt,k now have degrees not exceeding (m+1)T maxk deg ϕk. Now we perform the iterative substitutions of the powers gt n(x + km) in the polynomials Φt,k as described above, while keeping the same denominator for all t; this requires multiplying the common denominator by appropriate powers of ψn,dn(gn(x + (l + 1)h), . . . , gn(x + (l + s)h))). As a result of this process we obtain expressions n=0,...,N l=0,...,m s g Jnl n (x + lh) Ψt,k,J(g(x + (m s + 1)h), . . . , g(x + mh)). Here J = (Jnl) is a matrix with elements 0 Jnl dn 1, so summation over J is finite. The polynomials Ψt,k,J only depend on the remaining variables g(x + (m s + 1)h), . . . , g(x + mh). Importantly, by taking into account previously observed bounds on the degrees of polynomials appearing in Eqs. (119) and (121), we observe that deg Ψt,k,J CT with some constant C depending on s, m, N and the degrees of the polynomials ϕk and ψn,j appearing in the assumed expansions (108) and (115). Now the desired linear dependence of the monomials Ft results, with a suitable choice of m and T, by comparing the number (T + 1)m+1 of different monomials Ft with the number of different monomials that may appear in the expansions of the polynomials Ψt,k,J. Since these polynomials depend on the s(N + 1) variables g(x + (m s + 1)h), . . . , g(x + mh) and have degrees not exceeding CT, the number of different monomials in them does not exceed C1T s(N+1) with some constant C1 depending on the same parameters as C. Taking into account that the summations over k and J in expansion (122) involve, respectively, qm+1 and QN n=0 dm s+1 n terms, we see that the (T + 1)m+1 monomials Ft belong to a C1qm+1(QN n=0 dm s+1 n )T s(N+1)-dimensional linear functional space. Accordingly, if s(N + 1) < m + 1, then some of the monomials Ft will be linearly dependent if T is large enough. This completes the proof of part 1 of the lemma. 2. The argument above produces nontrivial linear combinations of monomials Ft with coefficients generally dependent on Q. We show now that, assuming appropriate degree bounds for Q and choosing a larger m, one can find nontrivial linear combinations with Q-independent coefficients. To this end, note that the main obstacle to choosing Q-independent coefficients is that the polynomials Ψt,k,J in expansion (122) are Q-dependent. To overcome this obstacle, we express these polynomials both in the variables g(x + (m s + 1)h), . . . , g(x + mh) and the coefficients b of the polynomials ϕk: Ψt,k,J(g(x + (m s + 1)h), . . . , g(x + mh)) = eΨt,k,J(b, g(x + (m s + 1)h), . . . , g(x + mh)). (123) Here, the vector b includes all the coefficients of all q + 1 polynomials ϕk appearing in the defining identity (108). By retracing the steps to construct the polynomials Ψt,k,J, we see that their dependence on the coefficients b is indeed polynomial. Moreover, deg eΨt,k,J is again bounded by CT, where C is some constant possibly dependent on N, s, m, q, max deg ϕk, maxj deg ψn,j, but not on T. Since we assume deg ϕk r, we have dim b (q + 1) N+r+1 N+1 . The total number of variables in eΨt,k,J is thus not greater than p = (q + 1) N + r + 1 N + 1 + s(N + 1). (124) We repeat now the dimensionality-based linear dependence argument. The total dimensionality of polynomials eΨt,k,J considered at all k, J is O(T p). On the other hand, the number of monomials Ft is (T + 1)m+1. If we set m p, then for sufficiently large T we can find a nontrivial vanishing linear combination of Ft with coefficients independent of Q. Step 3: Proof of Theorem 5. We apply now Lemma 5 to our setting in which g0(x) = x and gk(x) = e Pk(x), k = 1, . . . , N. We know from Step 1 that for k = 1, . . . , N conditions (106) hold with s = deg Pk + 1. For g0, condition (106) holds with s = 2 : g0(x) 2g0(x + h) + g0(x + 2h) = 0. (125) Since we assumed that deg Pk d, we can set s = max(d, 1) + 1 as a common value with which condition (106) holds for all n = 0, . . . , N. The conclusion of Theorem 5 then follows from statement 2 of Lemma 5. C Proof of Theorem 6 Since the network has not more than 2N hidden neurons and each of our activation functions has at most two branches, the output g(x) of the network at any x belongs to one of at most 2N branches represented by analytic expressions. Consider each of these branches as defined for all x [0, 1] and denote the family of all branches resulting from all g H(4) N by e H(4) N . Observe that e H(4) N H(3) N ,q,r,d if N , q, r, d are large enough. Indeed, the input of any algebraic-exponential neuron is a polynomial of degree not greater than d = 2N (if the squared Re LU is present; otherwise, if only the step function and Re LU/leaky Re LU are present, then one can take d = 1). All the considered algebraic-exponential activations are rational functions of x, ex or e ix, and the subsequent neurons are polynomial, so any branch of the full network can be written in the form (8) with a rational Q and 2N as the number of its exponential arguments. Rational Q s have algebraic degree q = 1. The maximum degree of the polynomials ϕk appearing in the defining relation (7) does not exceed r = 2N (again, if the squared Re LU is present; otherwise one can take r = 1). Summarizing, e H(4) N H(3) 2N,1,2N,2N . (126) Theorem 5 implies now that there exists some em and a nonzero polynomial e R of em+1 variables such that for any eg e H(4) N and any size-( em + 1) arithmetic progression ea, ea + eh, . . . , a + emeh [0, 1] e R(eg(ea), eg(ea + eh), eg(ea + 2eh), . . . , eg(ea + emeh)) = 0. (127) We need to obtain a similar relation (11), but for g H(4) N . Each value g(a + kh) in Eq. (11) belongs to one of the 2N branches eg e H(4) N , which we associate with p = 2N colors. By applying van der Waerden s theorem with p = 2N and s = em + 1, we see that if we set m = Nvd W( em + 1, 2N) 1, then the length-(m + 1) progression a, a + h, . . . , a + mh appearing in Eq. (11) contains at least one length-( em + 1) sub-progression ea, ea + eh, . . . , ea + emeh on which the values of g belong to the same branch. We can then satisfy condition (11) by defining the polynomial R by R(z0, . . . , zm) = Y s=(s0,...,sf m) e R(zs0, . . . , zsf m), (128) where s runs over all length-( em + 1) progressions from the set {0, . . . , m}. D Proof of Theorem 8 The only if part follows from Theorem 5. The if part is a generalization of Lemma 1 from [29]. Suppose that σ is not a polynomial; we want to prove that Hσ G1. By absorbing h in σ, we can assume that σ(0) = 0. Dropping then h, we specialize to f(x) = c sin(ωσ(bx)). Consider an arbitrary finite subset X = {x1, . . . , x N} [0, 1]. By Theorem 2, it is sufficient to show that there exsts b [ 1, 1] such that the values σ(bx1), . . . σ(bx N) are rationally independent. Suppose that this is not the case. For fixed coefficients λ = (λ1, . . . , λN), the function σλ(b) = PN n=1 λnσ(bxn) is real analytic for b [ 1, 1]. Since there are only countably many λ QN, there is some λ such that σλ vanishes at uncountably many b [ 1, 1]. Then, by analyticity, σλ 0 on [ 1, 1]. Expanding this σλ into the Taylor series at b = 0, we get the identity PN n=1 λnxm n = 0 for each m such that dmσ dwm (0) = 0. Suppose that xk have the natural order 0 x1 < . . . < x N. If there are infinitely many m for which dmσ dwm (0) = 0, then, by letting m , we first show that λN = 0, then λN 1 = 0, etc. If x1 > 0, then we exhaust all λn in this way; otherwise we exhaust all except λ1. However, λ1 = 0 follows then from σ(0) = 0. Thus, either λ = 0 or σ is a polynomial, which contradicts our assumption. E Proof of Theorem 9 Suppose that the function sequence fn(x) = cn sin(ωnσ(bnx) + hn) converges to a nontrivial limit point f . By a compactness argument, this requires at least one of the parameters cn, ωn to be unbounded. We expect unboundedness of one parameter to be compensated by another parameter converging to 0. Accordingly, we consider three possibilities: ωn 0, ωn ω = 0 and ωn . One of these possibilities can always by ensured by choosing a subsequence of fn, if necessary (in the sequel we will without further mention repeatedly use such reductions based on compactness and/or transition to subsequences). Case 1: ωn 0. Since σ(bnx) is uniformly bounded by our assumption and sin is periodic, we can assume that ωnσ(bnx) + hn n a with some constant a [0, 2π), for all x [0, 1]. If a = 0, π, then the limit f can only be trivial, a constant. Suppose that a = 0 (the case a = π leads to analogous results). In this case fn(x) = cn(ωnσ(bnx) + hn)(1 + o(1)), so f (x) = limn cn(ωnσ(bnx) + hn). There are two possibilities: bn b = 0 and bn 0. In the first case we obtain nontrivial limits f (x) = aσ(bx) + c, a, b, c R. (129) In the second case we can Taylor expand σ(bnx) = σ(0) + 1 r! drσ dxr (0)(bnx)r + o(br n), drσ dwr (0) = 0. (130) A nontrivial f requires cnbr n to have a finite nonzero limit. In this case, by adjusting hn we obtain nontrivial limits f (x) = a0 + arxr, a0, ar R. (131) Case 2: ωn ω = 0. Consider again the two possibilities: bn b = 0 and bn 0. In the first case we do not get any nontrivial limits. In the second case we again have Taylor expansion (130). Arguing as in Case 1, the only nontrivial limits that we can produce are of the form (131). Case 3: ωn . In this case we must have bn 0, since otherwise sin(ωnσ(bnx) + hn) will have an unbounded number of alternating values 1, so we must have cn 0 and accordingly the only possible limit is f = 0. It follows that we again have expansion (130), and fn(x) = cn sin ωn σ(0) + 1 r! drσ dxr (0)(bnx)r + o(br n) + hn . (132) To avoid triviality due to an unbounded number of alternating values 1 of sin, we must have ωnbr n = O(1). We get new nontrivial limits of fn when ωnbr n has a nonzero limit: f (x) = c sin(a0 + arxr), a, a0, ar R. (133) F Proof of Theorem 10 We need to prove that H(5) N G1 and H(5) N / G2. Proof of H(5) N G1. For N 2, this property can be concluded from Theorem 8, since we can just set σ(x) = sin(x) and h = h sin(0 x + π/2) in this theorem. In the case N = 1, observe that in the proof of Theorem 8 we only use h to redefine σ so that σ(0) = 0; otherwise we can set h = 0. Then, this proof becomes applicable to the family f(x) = c sin(ωσ(bx)) with σ(x) = sin(x + h) and any phase h = πk, k Z; this is a subfamily of H(5) N=1. Proof of H(5) N / G2. We will show that we cannot obtain a general continuous function as a limit of functions from H(5) N , because any such limit has a particular restrictive form. We write fn(x) = cn sin(gn(x)) + hn and consider several cases depending on the behavior of cn. Case 1: cn 0. In this case the limiting function is constant: f (x) = h. Case 2: cn c = 0. In this case hn must be bounded and hence can be assumed to converge to some h (by possibly choosing a subsequence). Denote ϕn(y) = cn sin(y) + hn; then ϕn(y) converges uniformly on R to ϕ (y) = c sin(y) + h. Clearly, the functions efn(x) = ϕ (gn) also uniformly converge to f . Consider now two alternatives: either the family {gn} is uniformly equicontinuous, or not. In the first case, fix some point x0 [0, 1], let rn = gn(x0) 2π and consider the shifted functions egn(x) = gn(x) 2πrn. We have ϕ (gn(x)) ϕ (egn(x)), the family {egn} is also equicontinuous, and egn(x0) [0, 2π] for all n, so that the family {egn} is additionally uniformly bounded on [0, 1]. Then, by the Arzelà Ascoli theorem, the family {egn} contains a uniformly convergent subsequence. It follows that f = ϕ (g ), where g is a uniform limit of some functions from H(2) N+1. Theorem 4 describes all such limits. In the case of the second alternative, there exist ϵ > 0 and a subsequence gns such that for some xs, x s [0, 1] we have |xs x s| 0 while |gns(xs) gns(x s)| ϵ for all s. Let rs = gns(xs) 2π and egs(x) = gns(x) 2πrs. Then egs(xs) [0, 2π] and by choosing a subsequence we can assume that gs(xs) converges to some y [0, 2π]. Using |egs(xs) egs(x s)| ϵ and the continuity of egs, we can then assume that egs(x s) converges to some y = y . Also, we can assume that xs and x s converge to some x [0, 1]. Then, by the uniform convergence of efn, supx [xs,x s] | efns(x) ef (x )| 0 as s . On the other hand, efns(x) = ϕ (gns(x)) = ϕ (egs(x)) so that, by the continuity of egs, lim s sup x [xs,x s] | efns(x) ef (x )| sup y [y ,y ] |ϕ (y) ef (x )| = 0, (134) because ϕ in nonconstant on any interval [y , y ]. This contradiction shows that the second alternative is impossible. Case 3: cn . In this case the range of gn must shrink, since otherwise the range of fn would be unbounded. Taking a subsequence, we can assume that for some y0 [0, 2π] we have gn(x) = y0 + 2πmn + o(1) uniformly on [0, 1]. We can then Taylor expand fn(x) = hn + cn sin(y0) + cn cos(y0)(gn y0 2πmn) (135) 2 sin(y0)(gn y0 2πmn)2 + cn O((gn y0 2πmn)3). (136) If cos(y0) = 0, then the respective term dominates the higher order terms and must be bounded as n increases. We can then discard the higher order terms, and we see that f is a uniform limit of the functions cngn an with some an. These limits are among the limits of the family H(2) N+1, again covered by Theorem 4. If cos(y0) = 0, then the respective term vanishes and the next, second order term will dominate the higher order terms. In this case f is a limit of functions of the form ψn = hn + an(gn bn)2. Substituting gn H(2) N and expanding products of sines, we see that ψn H(2) 1+2(N+1)2, so, again, the limits of ψn are restricted by Theorem 4.