# optimal_approximation_using_complexvalued_neural_networks__14c3cd22.pdf

Optimal approximation using complex-valued neural networks

Paul Geuchen MIDS, KU Eichstätt-Ingolstadt, Auf der Schanz 49, 85049 Ingolstadt, Germany paul.geuchen@ku.de

Felix Voigtlaender MIDS, KU Eichstätt-Ingolstadt, Auf der Schanz 49, 85049 Ingolstadt, Germany felix.voigtlaender@ku.de

Complex-valued neural networks (CVNNs) have recently shown promising empirical success, for instance for increasing the stability of recurrent neural networks and for improving the performance in tasks with complex-valued inputs, such as in MRI fingerprinting. While the overwhelming success of Deep Learning in the real-valued case is supported by a growing mathematical foundation, such a foundation is still largely lacking in the complex-valued case. We thus analyze the expressivity of CVNNs by studying their approximation properties. Our results yield the first quantitative approximation bounds for CVNNs that apply to a wide class of activation functions including the popular mod Re LU and complex cardioid activation functions. Precisely, our results apply to any activation function that is smooth but not polyharmonic on some non-empty open set; this is the natural generalization of the class of smooth and non-polynomial activation functions to the complex setting. Our main result shows that the error for the approximation of Ck-functions scales as m k/(2n) for m where m is the number of neurons, k the smoothness of the target function and n is the (complex) input dimension. Under a natural continuity assumption, we show that this rate is optimal; we further discuss the optimality when dropping this assumption. Moreover, we prove that the problem of approximating Ck-functions using continuous approximation methods unavoidably suffers from the curse of dimensionality.

1 Introduction

Deep Learning currently predominantly relies on real-valued neural networks, which have led to breakthroughs in fields like image classification or speech recognition [23, 27, 43]. However, recent work has uncovered several application areas in which the use of complex-valued neural networks (CVNNs) leads to better results than the use of their real-valued counterparts. These application areas mainly include tasks where complex numbers inherently occur as inputs of a machine learning model such as Magnetic Resonance Imaging (MRI) [16, 38, 44] and Polarimetric Synthetic Aperture Radar (Pol SAR) Imaging [8, 9, 48]. Moreover, CVNNs have been used to improve the stability of recurrent neural networks [5] and have been successfully applied in various other fields [32, 37]. The mathematical theory of these complex-valued neural networks, however, is still in its infancy. There is therefore a great interest in studying CVNNs and in particular in uncovering the differences and commonalities between CVNNs and their real-valued counterparts. A prominent example highlighting the unexpected differences between both network classes is the universal approximation theorem for neural networks, whose most general real-valued version was proven in 1993 [28] (with a more restricted version appearing earlier [17]) and which was recently generalized to the case of CVNNs [45]. The two theorems describe necessary and sufficient conditions on

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Condition on activation function

Continuity of weight selection

Approximation Error

Theorem 3.2 smooth & non-polyharmonic possible O(m k/(2n))

Consequence of Theorem 4.1 continuous assumed Ω(m k/(2n))

Theorem 4.2 very special activation function not assumed O(m k/(2n 1))

Theorem 4.3 1 1 + e Re(z) not assumed eΩ(m k/(2n))

Table 1: Overview of the proven approximation bounds. k is the regularity of the approximated functions (which are assumed to be Ck), n the (complex) input dimension and m the number of neurons in the hidden layer of the network. The notation eΩis similar to Ω, but ignoring log factors.

an activation function which guarantee that arbitrarily wide neural networks of a fixed depth can approximate any continuous function on any compact set arbitrarily well (with respect to the uniform norm). Already here it was shown that complex-valued networks behave significantly different from real-valued networks: While real-valued networks are universal if and only if the activation function is non-polynomial, complex-valued networks with a single hidden layer are universal if and only if the activation function is non-polyharmonic (see below for a definition). Furthermore, there exist continuous activation functions for which deep CVNNs are universal but shallow CVNNs are not, whereas the same cannot happen for real-valued neural networks. This example shows that it is highly relevant to study the properties of CVNNs and to examine which of the fundamental properties of real-valued networks extend to complex-valued networks.

Essentially the only existing quantitative result regarding the approximation-theoretical properties of CVNNs is [14], which provides results for approximating Ck-functions by deep CVNNs using the mod Re LU activation function. However, for real-valued NNs it is known that already shallow NNs can approximate Ck-functions at the optimal rate. Precisely, Mhaskar showed in [33] that one can approximate Ck-functions on [ 1, 1]n with an error of order m k/n as m , where m is the number of neurons in the hidden layer. Here he assumed that the activation function is smooth on an open interval and that at some point of this interval no derivative vanishes. This is equivalent to the activation function being smooth and non-polynomial on that interval, cf. [20, p. 53].

The present paper shows that a comparable result holds in the setting of complex-valued networks, by proving that one can approximate every function in Ck (Ωn; C) (where differentiability is understood in the sense of real variables) with an error of the order m k/(2n) (as m ) using shallow complex-valued neural networks with m neurons in the hidden layer. Here we define the cube Ωn := [ 1, 1]n + i[ 1, 1]n. The result holds whenever the activation function ϕ : C C is smooth and non-polyharmonic on some non-empty open set. This is a very natural condition, since for polyharmonic activation functions there exist Ck-functions that cannot be approximated at all below some error threshold using shallow neural networks with this activation function [45].

Furthermore, the present paper studies in how far the approximation order of m k/(2n) is optimal, meaning that an order of m (k/2n) α (where α > 0) cannot be achieved. Here it turns out that the derived order of approximation is indeed optimal (even in the class of CVNNs with possibly more than one hidden layer) in the setting that the weight selection is continuous, meaning that the map that assigns to a function f Ck (Ωn; C) the weights of the approximating network is continuous with respect to some norm on Ck(Ωn; C). This continuity assumption is natural since typical learning algorithms such as (stochastic) gradient descent use samples f(xj) of the target function and then apply continuous operations to them to update the network weights.

We investigate this optimality result further by dropping the continuity assumption and constructing two special smooth and non-polyharmonic activation functions with the first one having the property that the order of approximation can indeed be strictly improved via a discontinuous selection of

the related weights. For the second activation function we show that the order of m k/(2n) cannot be improved, even if one allows a discontinuous weight selection. This in particular shows that in the given generality of arbitrary smooth, non-polyharmonic activation functions, the upper bound O m k/(2n) cannot be improved, even for a possibly discontinuous choice of the weights. An overview of the approximation bounds proven in this paper can be found in Table 1.

Moreover, we analyze the tractability (in terms of the input dimension n) of the considered problem of approximating Ck-functions using neural networks. Theorem 5.1 shows that one necessarily needs a number of parameters exponential in n to obtain a non-trivial approximation error. To the best of our knowledge, Theorem 5.1 is the first result showing that the problem of approximating Ck-functions using continuous approximation methods is intractable (in terms of the input dimension n).

1.1 Related Work

Real-valued neural networks. By now, the approximation properties of real-valued neural networks are quite well-studied (cf. [10, 28, 33, 40, 41, 46, 47] and the references therein). We here only discuss a few papers that are most closely related to the present work.

In [33], Mhaskar analyzes the rate of approximation of shallow real-valued neural networks for target functions of regularity Ck. Our results can be seen as the generalization of [33] to the complex setting. Our proofs rely on several techniques from [33]; however, significant modifications are required to make the proofs work for general smooth non-polyharmonic functions.

One of the first papers to observe that neural networks with general (smooth) activation function can be surprisingly expressive is [31] where it was shown that a neural network of constant size can be universal. One of the activation functions in Section 4 is based on a similar idea.

The importance of distinguishing between continuous and discontinuous weight selection (which in our setting is discussed in Section 4) was observed for Re LU-networks in [47].

The paper [25] shows that neural network approximation is not continuous in the following sense: The best approximating neural network Φ(f) of a given size does not depend continuously on f Ck. This result, however, is not in conflict with our results: We want to assign to any Ck-function f a network eΦ(f) that approximates f below the error threshold m k/(2n). The network eΦ(f), however, does not have to coincide with the best approximating network Φ(f).

Complex-valued neural networks. When it comes to general literature about mathematical properties of complex-valued neural networks, surprisingly little work can be found. The Universal Approximation Theorem for Complex-Valued Neural Networks [45] has already been mentioned above. In particular, it has been shown that shallow CVNNs are universal if and only if the activation function ϕ is not polyharmonic. Thus, the condition assumed in the present paper (that ϕ should be smooth, but not polyharmonic) is quite natural.

Regarding quantitative approximation results for CVNNs, the only existing work of which we are aware is [14], analyzing the approximation capabilities of deep CVNNs where the mod Re LU is used as activation function. Since the mod Re LU satisfies our condition regarding the activation function, the present work can be seen as an improvement to [14]. Precisely, (i) we consider general activation functions, including but not limited to the mod Re LU, (ii) we improve the order of approximation by a log factor, and (iii) we show that this order of approximation can be achieved using shallow networks instead of the deep networks used in [14]. We stress that our proof techniques differ significantly from the ones applied in [14]: The arguments in [14] take their main ideas from [46] making heavy use of the specific definition of the mod Re LU. In contrast, since we consider quite general activation functions, we necessarily follow a much more general approach following the ideas from [33].

2 Preliminaries

Shallow complex-valued neural networks. In this paper we mainly consider so-called shallow complex-valued neural networks, meaning complex-valued neural networks with a single hidden layer. Precisely, we consider functions of the form

j=1 σjϕ ρT j z + ηj C,

ϕ(ρT 1 z + η1) ϕ(ρT mz + ηm)

σT Φ(z) = Pm j=1 σjϕ ρT j z + ηj

First layer

Figure 1: Graphical representation of a shallow neural network. Input and output neurons are depicted as dots, hidden neurons are depicted as circles. The term first layer describes the transformation from the input to the hidden neurons, including the application of the activation function.

with ρ1, ..., ρm Cn, σ1, ..., σm, η1, ..., ηm C and an activation function ϕ : C C. Here, m N denotes the number of neurons of the network and we write v T for the transpose of a vector v.

To simplify the formulation of the results, we introduce the following notation: We write Fϕ n,m for the set of first layers of shallow complex-valued neural networks with activation function ϕ, with n input neurons and m hidden neurons, meaning

Fϕ n,m := n z 7 ϕ ρT j z + ηj m

j=1 : ρj Cn, ηj C o {F : Cn Cm} .

Hence, each shallow CVNN can be written as σT Φ with σ Cm and Φ Fϕ n,m; see Figure 1 for a graphical representation of a shallow CVNN.

Approximation. The paper aims to analyze the approximation of Ck-functions on the complex cube

Ωn := [ 1, 1]n + i[ 1, 1]n

using shallow CVNNs. Here, we say that a function f : Ωn C is in Ck(Ωn; C) if and only if f is k times continuously differentiable on Ωn, where the derivative is to be understood in the sense of real variables, i.e., in the sense of interpreting f as a function [ 1, 1]2n R2 and taking usual real derivatives. We further define the Ck-norm of a function f Ck(Ωn; C) as

f Ck(Ωn;C) := sup k N2n 0 |k| k

kf L (Ωn;C), where g L (Ωn;C) := sup z Ωn |g(z)| (2.1)

for any function g : Ωn C. Note that we write N = {1, 2, 3, ...} and N0 = {0} N. Using the previously introduced notation, we thus seek to bound the worst-case approximation error, i.e.,

sup f Ck(Ωn;C) f Ck(Ωn;C) 1

inf Φ Fϕ n,m σ Cm

f σT Φ L (Ωn;C).

Wirtinger calculus and polyharmonic functions. For a function f : C C which is differentiable in the sense of real variables at a point z0 C we define its Wirtinger derivatives at z0 as

wirtf(z0) := 1

y (z0) and wirtf(z0) := 1

x(z0) + i f

Here, x and y denote the usual partial derivatives in the sense of real variables. We extend this definition to multivariate functions defined on open subsets of Cn by considering coordinatewise Wirtinger derivatives.

A function f : U C, where U C is an open set, is called smooth if it is differentiable arbitrarily many times (in the sense of real variables). We write f C (U; C) in that case. Moreover, f is called polyharmonic (on U) if it is smooth and if there exists m N0 satisfying

4 2 0 2 4 5

Re(z) Im(z)

|σmod Re LU, 1(z)|

4 2 0 2 4 5

Figure 2: Absolute value of the activation functions σmod Re LU, 1 (left) and card (right).

Here, := 2 x2 + 2

y2 = 4 wirt wirt denotes the usual Laplace-Operator.

The following Proposition 2.1 describes a property of non-polyharmonic functions which is crucial for proving the approximation results of this paper.

Proposition 2.1. Let = U C be an open set and let ϕ C (U; C) be non-polyharmonic. Then for every M N0 there exists a point z M U satisfying

m wirt ℓ wirtϕ(z M) = 0 for all m, ℓ N0 with m, ℓ M.

The proof of Proposition 2.1 is an application of the Baire category theorem; see Appendix B.2.

Important complex activation functions. We briefly discuss in how far two commonly used complex activation functions satisfy our assumptions: The mod Re LU proposed in [5] and the complex cardioid used in [44] for MRI fingerprinting where the performance could be significantly improved using complex-valued neural networks. The mod Re LU is defined as

σmod Re LU,b : C C, σmod Re LU,b(z) :=

( (|z| + b) z

|z|, if |z| + b 0, 0, otherwise,

where b < 0 is a fixed parameter. The complex cardioid is defined as

card : C C, card(z) := 1

2(1 + cos( z))z.

Here, z = θ R denotes the polar angle of a complex number z = reiθ, where we define 0 := 0; see Figure 2 for plots of the absolute value of the two functions.

Both functions are smooth and non-polyharmonic on a non-empty open subset of C, which is proven in Appendix A.2. Furthermore, they are both continuous on C. Therefore, our approximation bounds established in Theorems 3.1 and 3.2 in particular apply to those two functions.

3 Main results

In this section we state the main results of this paper and provide proof sketches for them. Detailed proofs of the two statements can be found in Appendices B.3 and C.3.

First we show in Theorem 3.1 that it is possible to approximate any complex polynomial in z and z arbitrarily well using shallow CVNNs with the size of the networks only depending on the degree of the polynomial (not on the desired approximation accuracy). Using this result we can prove the main approximation bound, Theorem 3.2, by first approximating a given Ck-function using a polynomial in z and z and then approximating this polynomial using Theorem 3.1.

For m, n N let

Cn C, z 7 X

ℓ m am,ℓzmzℓ: am,ℓ C

denote the space of all complex polynomials on Cn in z and z of componentwise degree at most m. Here, we are summing over all multi-indices m, ℓ Nn 0 with mj, ℓj m for every j {1, ..., n} and use the notation

j=1 zmj j and zℓ:=

The space Pn m is finite-dimensional; hence, it makes sense to talk about bounded subsets of Pn m without specifying a norm. Theorem 3.1. Let m, n N, ε > 0 and ϕ : C C be smooth and non-polyharmonic on an open set = U C. Let P Pn m be bounded and set N := (4m + 1)2n. Then there exists a first layer Φ Fϕ n,N with the following property: For each polynomial p P there exists σ CN, such that p σT Φ L (Ωn;C) ε.

Sketch of proof. For any multi-indices m, ℓ Nn 0 an inductive argument shows for every fixed z Ωn and b C that

m wirt ℓ wirt w 7 ϕ(w T z + b) = zmzℓ |m| wirt |ℓ| wirtϕ (w T z + b).

Here, m wirt and ℓ wirt denote the multivariate Wirtinger derivatives with respect to w according to the multi-indices m and ℓ, respectively. Evaluating this at w = 0 and taking b C such that none of the mixed Wirtinger derivatives of ϕ at b up to a sufficiently high order vanish (where such a b exists by Proposition 2.1) shows that we can rewrite

zmzℓ= m wirt ℓ wirt w 7 ϕ(w T z + b) w=0 |m| wirt |ℓ| wirtϕ (b) 1 . (3.1)

The mixed Wirtinger derivatives can by definition be expressed as linear combinations of usual partial derivatives. Those partial derivatives can be approximated using a generalized version of divided differences: If g Ck(( r, r)s; R) and p Ns 0 with |p| k we have

pg(0) (2h) |p| X

0 r p ( 1)|p| |r| p r

g (h(2r p)) for h 0. (3.2)

See Appendix B.1 for a proof of this approximation. Note that when one takes g(w) = ϕ(w T z + b), the right-hand side of (3.2) is a shallow neural network, as a function of z.

Combining (3.1) and (3.2) yields the desired result; see Appendix B.3 for the details.

It is crucial that the size of the networks considered in Theorem 3.1 is independent of the approximation accuracy ε. Moreover, the first layer Φ can be chosen independently of the particular polynomial p. Only the weights σ connecting hidden layer and output neuron have to be adapted to p.

The final approximation result is as follows. Its full proof can be found in Appendix C.3. Theorem 3.2. Let n, k N. Then there exists a constant c = c(n, k) > 0 with the following property: For any activation function ϕ : C C that is smooth and non-polyharmonic on an open set = U C and for any m N there exists a first layer Φ Fϕ n,m with the following property: For any f Ck (Ωn; C) there exist coefficients σ = σ(f) Cm, such that f σT Φ L (Ωn;C) c m k/(2n) f Ck(Ωn;C) .

Furthermore, the map f 7 σ(f) is a continuous linear operator with respect to the L -norm.

Sketch of proof. By splitting f into real and imaginary part we may assume that f is real-valued. Fourier-analytical results (recalled in Appendix C.1) imply that each Ck-function f can be well approximated by a linear combination of (multivariate) Chebyshev polynomials. Precisely, we have

f P L (Ωn;R) c mk f Ck(Ωn;R) ,

where P is given via the formula

0 k 2m 1 Vm k (f) Tk(z), z Ωn.

Here, the functions Tk are multivariate versions of Chebyshev polynomials and Vm k (f) are continuous linear functionals in f. The constant c > 0 only depends on n and k. See Appendix C.1 for a rigorous proof of this approximation property. Approximating the polynomials Tk by neural networks according to Theorem 3.1 yields the desired claim; see Appendix C.3 for the details.

Remark 3.3. Theorem 3.1 can not only be used to derive approximation rates for the approximation of Ck-functions but can be applied to any function class that is well approximable by algebraic polynomials. For example, it can be used to prove an order of approximation of ν m1/(2n)/17 for the approximation of functions f : Ωn C that can be holomorphically extended onto some polyellipse in C2n. The parameter ν > 1 specifies the size of this polyellipse. We refer the interested reader to Appendix D for detailed definitions, statements and proofs for this fact.

4 Optimality of the derived approximation rate

In this section we discuss the optimality of the approximation rate proven in Theorem 3.2. We first deduce from general results by De Vore et al. [19] that the rate is optimal in the setting that the map which assigns to a function f Ck (Ωn; C) the weights of the approximating network is continuous, as is the case in Theorem 3.2. However, it might be possible to achieve a better degree of approximation if this map is not required to be continuous, which is discussed in the second part of this section. Proofs for all the statements in this section are given in Appendices E.1, F.2 and F.3.

Continuous weight selection. We begin by considering the case of continuous weight selection. As mentioned in the introduction, this is a natural assumption, since in classical training algorithms such as (S)GD, continuous operations based on samples f(xj) are used to adjust the weights.

In [19, Theorem 4.2] a lower bound of m k/s is established for the rate of approximating functions f of Sobolev regularity W k, in the following very general setting: The set of functions that is used for approximation can be parametrized using m (real) parameters and the map that assigns to f the parameters of the approximating function is continuous with respect to some norm on W k, ([ 1, 1]s; R). A detailed version of the proof of that statement (for Ck instead of W k, ) is contained in Appendix E.1. A careful transfer of this result to the complex-valued setting yields the following theorem (see also Appendix E.1).

Theorem 4.1. Let n, k N. Then there exists a constant c = c(n, k) > 0 with the following property: For any m N, any map a : Ck(Ωn; C) Cm that is continuous with respect to some norm on Ck(Ωn; C) and any map M : Cm C(Ωn; C) we have

sup f Ck(Ωn;C), f Ck(Ωn;C) 1 f M(a(f)) L (Ωn;C) c m k/(2n).

The interpretation of Theorem 4.1 is as follows: If one approximates Ck-functions on Ωn using a set of functions that can be parametrized using m (complex) parameters and one assumes that the weight assignment is continuous, one cannot achieve a better rate of approximation than m k/(2n). As a special case it can be deduced that the approximation rate is at most m k/(2n) when approximating Ck-functions using shallow CVNNs with m parameters under continuous weight assignment (see Corollary E.3). One can show that this even holds for deep CVNNs. Hence, for continuous weight selection the rate proven in Theorem 3.2 is optimal, even in the class of (possibly) deep networks.

Discontinuous weight selection. When we drop the continuity assumption on the selection of the weights, the behavior of the optimal approximation rate is more subtle. Precisely, we show that there are activation functions for which the rate of approximation can be improved to m k/(2n 1). On the other hand, we also show that there are activation functions for which an improvement of the approximation rate (up to logarithmic factors) is not possible.

Theorem 4.2. There exists a function ϕ C (C; C) which is non-polyharmonic with the following additional property: For every k N and n N there exists a constant c = c(n, k) > 0 such that

-2 -1 1 2 3 4 5 6 7

e Re(z) u1 u2

Figure 3: Illustration of the construction of the activation function in Theorem 4.2.

for any m N and f Ck (Ωn; C) there is a first layer Φ Fϕ n,m and a vector σ Cm such that f σT Φ L (Ωn;C) c m k/(2n 1) f Ck(Ωn;C) .

Sketch of proof. The function ϕ is constructed in the following way: Take a countable dense subset {uℓ}ℓ N of C(Ω1; C), for instance the set of all polynomials in z and z with coefficients in Q + i Q. Define ϕ in a way such that ϕ(z + 3ℓ) = uℓ(z)

for every z Ω1 and ℓ N. Furthermore, to ensure that ϕ is non-polyharmonic, let ϕ(z) = e Re(z) for every z Ω1. The smoothness of ϕ can be accomplished by multiplying with a smooth bump function; see Lemma F.4 for details. The construction of ϕ is illustrated in Figure 3.

Let then f Ck(Ωn; C) be arbitrary. General results from the theory of ridge functions [22] show that there exist b1, ..., bm Cn and g1, ..., gm C(Ω1; C) such that f(z)

j=1 gj b T j z L (Ωn;C)

c m k/(2n 1) f Ck(Ωn;C) ;

see Proposition F.3 and Appendix F.1 for a detailed proof of this fact. Approximating the functions gj by suitable functions uℓj and expressing those functions via ϕ( + 3ℓj) yields the claim.

The preceding theorem showed that there exists an activation function for which the rate in Theorem 3.2 can be strictly improved, if one allows a discontinuous weight selection. In contrast, the following theorem shows for a certain (quite natural) activation function that the rate m k/(2n) cannot be improved (up to logarithmic factors), even if one allows a discontinuous weight assignment. Theorem 4.3. Let n, k N and

ϕ : C C, ϕ(z) := 1 1 + e Re(z) .

Then ϕ is smooth but non-polyharmonic. Furthermore, there exists a constant c = c(n, k) > 0 with the following property: For any m N 2 there exists a function f Ck (Ωn; C) with f Ck(Ωn;C) 1, such that for every Φ Fϕ n,m and σ Cm we have f σT Φ L (Ωn;C) c (m ln(m)) k/(2n) .

Sketch of proof. The idea of the proof is based on that of [46, Theorem 4] but instead of the bound for the VC dimension of Re LU networks used in [46], we will employ a bound for the VC dimension stated in [4, Theorem 8.11] using the real sigmoid function. For a detailed introduction to the concept of the VC dimension and related topics, see for instance [39, Chapter 6].

A technical reduction from the complex to the real case (see Appendix F.3) shows that it suffices to show the following: If ε (0, 1

2) and m N are such that for every f Ck ([ 1, 1]n; R) with f Ck([ 1,1]n;R) 1 there exists a real-valued shallow network N with γ(x) = 1 1+e x as activation function satisfying f N L ([ 1,1]n;R) ε, then necessarily

where the constant c only depends on n and k.

To show that the latter claim holds, we assume that ε and m have the property stated above. Take N N such that N k ε and consider the grid

N { N, ..., N}n [ 1, 1]n.

For every α G we pick a number zα {0, 1} arbitrarily and construct a map f C ([ 1, 1]n; R) satisfying f(α) = zα for very α G. Scaling of f to f ensures f Ck([ 1,1]n;R) 1, but then f(α) = c0 zα N k where c0 = c0(n, k) > 0. By assumption, we can infer the existence of a shallow real-valued neural network N with γ as activation function and m hidden neurons satisfying f N L ([ 1,1]n;R) ε. But this shows

N(α) > c N k, if zα = 1, < c N k, if zα = 0 for all α G

with a constant c = c(n, k) > 0. Since the zα are arbitrary, it follows that the set

H := n 1(N > c N k) G : N shallow NN with activation γ and m hidden neurons o

shatters the whole grid G. This yields VC(H) |G| = (2N + 1)n. On the other hand, the bound provided by [4, Theorem 8.11] for linear threshold networks yields VC(H) m ln(N). Combining the two bounds and using N k ε yields the claim.

5 Tractability of the considered problem in terms of the input dimension

In this section we discuss the tractability (in terms of the input dimension n) of the considered problem, i.e., the dependence of the approximation error on n. We show a novel result stating that, assuming a continuous weight selection, the problem of approximating Ck-functions is intractable, i.e., that the number of neurons that is required to achieve a non-trivial approximation error is necessarily exponential in n. In the literature this is referred to as the curse of dimensionality. The proof of the theorem combines ideas from [19] and [35] and is contained in Appendix E.2. Theorem 5.1. Let s N. With Ck([ 1,1]s;R) defined similarly to (2.1), we write

f C ([ 1,1]s;R) := sup k N f Ck([ 1,1]s;R) [0, ]

for any function f C ([ 1, 1]s; R) and denote by C , ,s the set of all f C ([ 1, 1]s; R) for which this expression is finite. Let a : C , ,s R2s 1 be continuous with respect to some norm on C , ,s and moreover, let M : R2s 1 C([ 1, 1]s; R) be an arbitrary map. Then it holds

sup f C , ,s

f C ([ 1,1]s;R) 1

f M(a(f)) L ([ 1,1]s;R) 1.

Note that Theorem 5.1 is formulated for real-valued functions but can be transferred to the complexvalued setting (see Corollary E.6). We decided to include the real-valued statement because it is expected to be of greater interest in the community than the complex-valued analog. Moreover, we stress that Theorem 5.1 is not limited to the class of shallow neural networks but refers to any function class that is parametrizable using finitely many parameters (in particular, e.g., the class of neural networks with possibly more than one hidden layer).

We now examine in what way the constant c appearing in Theorem 3.2 suffers from the curse of dimensionality. To this end, it is convenient to rewrite the result from Theorem 3.2 as

sup f Ck(Ωn;C) 1 inf Φ Fϕ n,m,σ Cm f σT Φ L (Ωn;C) ( c m) k/(2n)

where the constant c = c(n, k) > 0 is independent of m. Writing the result in that way, one sees immediately that, if one seeks to have a worst-case approximation error of less than ε > 0, it is

sufficient to take m = 1

c ε (2n)/k neurons in the hidden layer of the network. Corollary E.7 shows that it necessarily holds c 16 2 n and therefore, the constant c unavoidably suffers from the curse of dimensionality. An analysis of the constant (where we refer to Appendices C.1 to C.3 for the details) shows that in our case we can establish the bound c(n, k) exp( C n2) k 4n with an absolute constant C > 0. We remark that, since the constant suffers from the curse of dimensionality in any case, we have not put much effort into optimizing the constant; there is therefore probably ample room for improvement.

6 Limitations

To conduct a comprehensive evaluation of machine learning algorithms, one must analyze the questions of approximation, generalization, and optimization through training algorithms. The present paper, however, only focuses on the aspect of approximation. Analyzing if the proven approximation rate can be attained with learning algorithms such as (stochastic) gradient descent falls outside the scope of this paper. Furthermore, the examination of approximation rates under possibly discontinuous weight assignment is not yet fully resolved by our results. It is an open question which rate is optimally achievable in that case, depending on the choice of the activation function, and specifically in distinguishing between shallow and deep networks. We want to mention the two following points which indicate that this is a quite subtle question:

1. For deep NNs (with more than two hidden layers) with general smooth activation function, it is not possible to derive any non-trivial lower bounds in the setting of unrestricted weight assignment, since there exists an activation function with the property that NNs of constant size using this activation function can approximate any continuous function to arbitrary precision (see [31, Theorem 4]). Note that [31] considers real-valued NNs, but the results can be transferred to CVNNs with a suitable choice of the activation function. 2. In the real-valued case, fully general lower bounds for the approximation capabilities of shallow NNs have been derived by using results from [22] regarding the approximation properties of so-called ridge functions, i.e., functions of the form Pm j=1 ϕj( aj, x ) with aj Rd and each ϕj : R R. It is an interesting problem to generalize these results to higher-dimensional ridge functions of the form Pm j=1 ϕj(Ajx), where each ϕj : Rs R and Aj Rs d. This would imply lower bounds for shallow CVNNs. However, such a generalization seems to be highly non-trivial and is outside the scope of the present paper.

7 Conclusion

This paper analyzes error bounds for the approximation of complex-valued Ck-functions by means of complex-valued neural networks with smooth and non-polyharmonic activation functions. It is demonstrated that complex-valued neural networks with these activation functions achieve the identical approximation rate as real-valued networks that employ smooth and non-polynomial activation functions. This is an important theoretical finding, since CVNNs are on the one hand more restrictive than real-valued neural networks (since the mappings between layers should be C-linear and not just R-linear), but on the other hand more versatile, since the activation function is a mapping from C to C (i.e., from R2 R2) rather than from R to R as in the real case. Additionally, it is established that the proven approximation rate is optimal if one assumes a continuous weight selection. In summary, if one focuses on the approximation rate for Ck-functions, CVNNs have the same excellent approximation properties as real-valued networks.

The behavior of the approximation rate for unrestricted weight selection is more subtle. It is shown that a rate of m k/(2n 1) can be achieved for certain activation functions (Theorem 4.2) but in general, one cannot improve on the rate that is attainable for continuous weight selection (Theorem 4.3).

While the proven approximation rate is optimal under the assumption of continuous weight selection, the involved constants suffer from the curse of dimensionality. Section 5, however, shows that this is inevitable in the given setting.

Such theoretical approximation results contribute to the mathematical understanding of Deep Learning. The remarkable approximation-theoretical properties of neural networks can be seen as one reason why neural networks provide outstanding results in many applications.

Acknowledgments. PG and FV acknowledge support by the German Science Foundation (DFG) in the context of the Emmy Noether junior research group VO 2594/1-1. FV acknowledges support by the Hightech Agenda Bavaria.

[1] M. Abramowitz and I. A. Stegun, eds. Handbook of mathematical functions: with formulas, graphs, and mathematical tables. 9. Dover print. Dover books on mathematics. New York, NY: Dover Publ, 2013. ISBN: 978-0-486-61272-0. [2] B. Adcock, S. Brugiapaglia, and C. G. Webster. Sparse polynomial approximation of high-dimensional functions. Computational science and engineering 25. Philadelphia: Society for Industrial and Applied Mathematics, 2022. 292 pp. ISBN: 978-1-61197-687-8. [3] C. D. Aliprantis and K. C. Border. Infinite dimensional analysis: a hitchhiker s guide. 3rd ed. Berlin ; New York: Springer, 2006. ISBN: 978-3-540-29586-0. [4] M. Anthony and P. L. Bartlett. Neural network learning: theoretical foundations. Cambridge ; New York, NY: Cambridge University Press, 1999. ISBN: 978-0-521-57353-5. [5] M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks . In: International conference on machine learning. PMLR. 2016, pp. 1120 1128. URL: http://proceedings.mlr. press/v48/arjovsky16.pdf. [6] K. E. Atkinson. An introduction to numerical analysis. 2nd ed. New York: Wiley, 1989. ISBN: 978-0471-62489-9. [7] M. B. Balk. Polyanalytic functions. Mathematical research 63. Berlin: Akademie Verlag, 1991. ISBN: 978-3-05-501292-1. [8] J. A. Barrachina, C. Ren, C. Morisseau, G. Vieillard, and J.-P. Ovarlez. Complex-valued vs. real-valued neural networks for classification perspectives: An example on non-circular data . In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021, pp. 2990 2994. [9] J. A. Barrachina, C. Ren, C. Morisseau, G. Vieillard, and J.-P. Ovarlez. Comparison between equivalent architectures of complex-valued and real-valued neural networks - Application on polarimetric SAR image segmentation . In: Journal of Signal Processing Systems 95.1 (2023), pp. 57 66. [10] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function . In: IEEE Transactions on Information theory 39.3 (1993), pp. 930 945. DOI: 10.1109/18.256500. [11] H. Bauer. Measure and integration theory. De Gruyter studies in mathematics 26. Berlin ; New York: W. de Gruyter, 2001. 230 pp. ISBN: 978-3-11-016719-1. [12] A. Benjamin and J. J. Quinn. Proofs that really count: the art of combinatorial proof. Dolciani mathematical expositions. Washington, DC: Mathematical Association of America, 2003. ISBN: 978-088385-333-7. [13] D. Berend and T. Tassa. Improved bounds on Bell numbers and on moments of sums of random variables . In: Probability and Mathematical Statistics 30.2 (2010), pp. 185 205. [14] A. Caragea, D. G. Lee, J. Maly, G. Pfander, and F. Voigtlaender. Quantitative Approximation Results for Complex-Valued Neural Networks . In: SIAM Journal on Mathematics of Data Science 4.2 (2022), pp. 553 580. DOI: 10.1137/21M1429540. [15] H. H. Çevik, Y. E. Acar, and M. Çunka s. Day Ahead Wind Power Forecasting Using Complex Valued Neural Network . In: 2018 International Conference on Smart Energy Systems and Technologies (SEST). 2018, pp. 1 6. DOI: 10.1109/SEST.2018.8495637. [16] E. Cole, J. Cheng, J. Pauly, and S. Vasanawala. Analysis of deep complex-valued convolutional neural networks for MRI reconstruction and phase-focused applications . In: Magnetic resonance in medicine 86.2 (2021), pp. 1093 1109. [17] G. Cybenko. Approximation by superpositions of a sigmoidal function . In: Mathematics of Control, Signals, and Systems 2.4 (Dec. 1989), pp. 303 314. DOI: 10.1007/BF02551274. [18] K. Deimling. Nonlinear Functional Analysis. Dover books on mathematics. Dover Publications, 2013. ISBN: 978-3-662-00549-1. [19] R. A. De Vore, R. Howard, and C. Micchelli. Optimal nonlinear approximation . In: Manuscripta Mathematica 63.4 (Dec. 1989), pp. 469 478. DOI: 10.1007/BF01171759. [20] W. F. Donoghue. Distributions and Fourier transforms. OCLC: 45151. New York: Academic Press, 1969. ISBN: 978-0-12-220650-4. [21] G. B. Folland. Real analysis: modern techniques and their applications. 2nd ed. Pure and applied mathematics. New York: Wiley, 1999. ISBN: 978-0-471-31716-6. [22] Y. Gordon, V. Maiorov, M. Meyer, and S. Reisner. On the Best Approximation by Ridge Functions in the Uniform Norm . In: Constructive Approximation 18.1 (Jan. 1, 2001), pp. 61 85. DOI: 10.1007/s00365001-0009-5.

[23] A. Graves, A.-R. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks . In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 2013, pp. 6645 6649. DOI: 10.1109/ICASSP.2013.6638947. [24] W. P. Johnson. The Curious History of Faa di Bruno s Formula . In: The American Mathematical Monthly 109.3 (Mar. 2002), p. 217. DOI: 10.2307/2695352. [25] P. C. Kainen, V. K urková, and A. Vogt. Approximation by neural networks is not continuous . In: Neurocomputing 29.1 (1999), pp. 47 56. DOI: https://doi.org/10.1016/S0925-2312(99)001113. [26] L. Kaup and B. Kaup. Holomorphic Functions of Several Variables: An Introduction to the Fundamental Theory. De Gruyter, Dec. 1983. ISBN: 978-3-11-004150-7. DOI: 10.1515/9783110838350. [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Image Net classification with deep convolutional neural networks . In: Communications of the ACM 60.6 (May 2017), pp. 84 90. DOI: 10.1145/3065386. [28] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function . In: Neural Networks 6.6 (Jan. 1993), pp. 861 867. DOI: 10.1016/S0893-6080(05)80131-5. [29] G. Lorentz. Approximation of Functions. Approximation of Functions. Holt, Rinehart and Winston, 1966. [30] V. Maiorov. On Best Approximation by Ridge Functions . In: Journal of Approximation Theory 99.1 (July 1999), pp. 68 94. DOI: 10.1006/jath.1998.3304. [31] V. Maiorov and A. Pinkus. Lower bounds for approximation by MLP neural networks . In: Neurocomputing 25.1 (1999), pp. 81 91. DOI: https://doi.org/10.1016/S0925-2312(98)001118. [32] A. Marseet and F. Sahin. Application of complex-valued convolutional neural network for next generation wireless networks . In: 2017 IEEE Western New York Image and Signal Processing Workshop (WNYISPW). 2017, pp. 1 5. DOI: 10.1109/WNYIPW.2017.8356260. [33] H. N. Mhaskar. Neural Networks for Optimal Approximation of Smooth and Analytic Functions . In: Neural Computation 8.1 (Jan. 1996), pp. 164 177. DOI: 10.1162/neco.1996.8.1.164. [34] C. Muscalu and W. Schlag. Classical and multilinear harmonic analysis. Volume 1. OCLC: 887742576. Cambridge, UK: Cambridge University Press, 2013. ISBN: 978-0-521-88245-3. [35] E. Novak and H. Wo zniakowski. Approximation of infinitely differentiable multivariate functions is intractable . In: Journal of Complexity 25.4 (2009), pp. 398 404. DOI: https://doi.org/10.1016/j. jco.2008.11.002. [36] A. Pinkus. Ridge functions. Cambridge: Cambridge University Press, 2016. ISBN: 978-1-107-12439-4. [37] Y.-D. Qu, R.-Q. Zhang, S.-Q. Shen, J. Yu, and M. Li. Entanglement Detection with Complex-Valued Neural Networks . In: International Journal of Theoretical Physics 62.9 (2023), p. 206. [38] H. El-Rewaidy, U. Neisius, J. Mancio, S. Kucukseymen, J. Rodriguez, A. Paskavitz, B. Menze, and R. Nezafat. Deep complex convolutional network for fast reconstruction of 3D late gadolinium enhancement cardiac MRI . In: NMR in Biomedicine 33.7 (2020), e4312. [39] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: from theory to algorithms. New York, NY, USA: Cambridge University Press, 2014. 397 pp. ISBN: 978-1-107-05713-5. [40] Z. Shen, H. Yang, and S. Zhang. Deep Network Approximation Characterized by Number of Neurons . In: Communications in Computational Physics 28.5 (2020), pp. 1768 1811. DOI: https://doi.org/ 10.4208/cicp.OA-2020-0149. [41] J. W. Siegel and J. Xu. Approximation rates for neural networks with general activation functions . In: Neural Networks 128 (2020), pp. 313 321. DOI: https://doi.org/10.1016/j.neunet.2020.05. 019. [42] P. Stein. A Note on the Volume of a Simplex . In: The American Mathematical Monthly 73.3 (Mar. 1966), p. 299. DOI: 10.2307/2315353. [43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is All you Need . In: Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [44] P. Virtue, S. X. Yu, and M. Lustig. Better than real: Complex-valued neural nets for MRI fingerprinting . In: 2017 IEEE International Conference on Image Processing (ICIP). Beijing: IEEE, Sept. 2017, pp. 3953 3957. ISBN: 978-1-5090-2175-8. DOI: 10.1109/ICIP.2017.8297024. [45] F. Voigtlaender. The universal approximation theorem for complex-valued neural networks . In: Applied and Computational Harmonic Analysis (Dec. 2022), S1063520322001014. DOI: 10.1016/j.acha. 2022.12.002. [46] D. Yarotsky. Error bounds for approximations with deep Re LU networks . In: Neural Networks 94 (Oct. 2017), pp. 103 114. DOI: 10.1016/j.neunet.2017.07.002.

[47] D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation rates for deep neural networks . In: Advances in neural information processing systems 33 (2020), pp. 13005 13015. URL: https:// proceedings.neurips.cc/paper/2020/file/979a3f14bae523dc5101c52120c535e9-Paper. pdf. [48] Z. Zhang, H. Wang, F. Xu, and Y.-Q. Jin. Complex-Valued Convolutional Neural Network and Its Application in Polarimetric SAR Image Classification . In: IEEE Transactions on Geoscience and Remote Sensing 55.12 (2017), pp. 7177 7188. DOI: 10.1109/TGRS.2017.2743222.

A Notation, Wirtinger derivatives and special activation functions

A.1 Notation and Wirtinger derivatives

Throughout the paper, multi-indices (i.e., elements of Nn 0) are denoted using boldface. For m Nn 0 we have the usual notation |m| = Pn j=1 mj. For a number m N0 and another multi-index p Nn 0 we write m m iff mj m for every j {1, ..., n} and m p iff mj pj for every j {1, ..., n}. Furthermore we write p r

for two multi-indices p, r Nn 0 with r p. For a complex vector z Cn we write

j=1 zmj j and zm :=

For a point x Rn and r > 0 we define

Br(x) := {y Rn : x y < r}

with denoting the usual Euclidean distance. This definition is analogously transferred to Cn.

Cn is canonically isomorphic to R2n by virtue of the isomorphism

φn : R2n Cn, (x1, ..., xn, y1, ..., yn) 7 (x1 + iy1, ..., xn + iyn) . (A.1)

The Wirtinger derivatives defined in Section 2 have the following properties that we are going to use, which can be found for example in [26, E.1a]. Here, we assume that U C is open and f C1(U; C).

1. wirt and wirt are both C-linear operators on the set C1(U; C).

2. f is complex-differentiable in z U iff wirtf(z) = 0 and in this case the equality

wirtf(z) = f (z)

holds true, with f (z) denoting the complex derivative of f at z. 3. We have the conjugation rules

wirtf = wirtf and wirtf = wirtf.

4. If g C1(U; C), the following product rules for Wirtinger derivatives hold for every z U:

wirt(f g)(z) = wirtf(z) g(z) + f(z) wirtg(z),

wirt(f g)(z) = wirtf(z) g(z) + f(z) wirtg(z).

This product rule is not explicitly stated in [26] but follows easily from the product rule for x and y.

5. If V C is an open set and g C1(V ; C) with g(V ) U, then the following chain rules for Wirtinger derivatives hold true:

wirt(f g) = [( wirtf) g] wirtg +

wirtf g wirtg,

wirt(f g) = [( wirtf) g] wirtg +

wirtf g wirtg.

6. If f C2(U; C) then we have

f(z) = 4 wirt wirtf (z)

for every z U, with denoting the usual Laplace-Operator = (2,0) + (0,2) (cf. [7, equation (1.7)]).

For an open set U Cn, a function f Ck(U; C) and a multi-index ℓ Nn 0 with |ℓ| k we write ℓ wirtf and ℓ wirtf for the iterated Wirtinger derivatives according to the multi-index ℓ.

Proposition A.1. Let U Cn be an open set and f Ck(U; C). Then, identifying f with the function f φn with φn as in (A.1), for any multi-indices m, ℓ Nn 0 with |m + ℓ| k we have the representation m wirt ℓ wirtf(a) = X

p=(p ,p ) N2n 0 p +p =m+ℓ

bp ( pf) (a) a U

with complex numbers bp C only depending on p, m and ℓand writing p = (p , p ) for the concatenation of the multi-indices p and p Nn 0. In particular, the coefficients do not depend on f.

Proof. The proof is by induction over m and ℓ. We first assume m = 0 and show the claim for all ℓ Nn 0 with |ℓ| k. In the case ℓ= 0 there is nothing to show, so we assume the claim to be true for fixed ℓ Nn 0 with |ℓ| < k and take j {1, ..., n} arbitrarily. Then we get

ℓ+ej wirt f(a) = ej wirt ℓ wirtf(a) IH= X

p=(p ,p ) N2n 0 p +p =ℓ

bp ej wirt ( pf) (a)

p=(p ,p ) N2n 0 p +p =ℓ

(p +ej,p )f (a) + ibp

(p ,p +ej)f (a)

p=(p ,p ) N2n 0 p +p =ℓ+ej

bp ( pf) (a),

as was to be shown.

Since we have shown the case m = 0 we may assume the claim to be true for fixed m, ℓ Nn 0 with |m + ℓ| < k. Then we deduce

m+ej wirt ℓ wirtf(a) = ej wirt m wirt ℓ wirtf(a) IH= X

p=(p ,p ) N2n 0 p +p =m+ℓ

bp ej wirt ( pf) (a)

p=(p ,p ) N2n 0 p +p =m+ℓ

(p +ej,p )f (a) ibp

(p ,p +ej)f (a)

p=(p ,p ) N2n 0 p +p =m+ℓ+ej

bp ( pf) (a).

Hence the claim follows using the principle of mathematical induction.

Proposition A.1 is technical but crucial: In the course of the paper we will need to approximate Wirtinger derivatives of certain functions. In fact, however, we will approximate real derivatives and take advantage of the fact that Wirtinger derivatives are linear combinations of these.

A.2 Concrete examples of activation functions

In this section we analyze concrete activation functions that are commonly used in practical applications of complex-valued neural networks. We are going to show that those activation functions are smooth and non-polyharmonic on some non-empty open subset of C. Our first result analyzes a family of real activation functions interpreted as complex activation functions . Proposition A.2. Let ρ C (R; R) be non-polynomial and let ψ : C C be defined as

ψ(z) := ρ(Re(z)) resp. ψ(z) := ρ(Im(z)).

Then ψ is smooth and non-polyharmonic.

Proof. Since ψ depends only on the real-, resp. imaginary part of the input, we see directly from the definition of the Wirtinger derivatives that

wirtψ(z) = wirtψ(z) = 1

2ρ (Re(z)) resp. wirtψ(z) = wirtψ(z) = i

2ρ (Im(z)).

Hence we see for arbitrary m, ℓ N0 that m wirt ℓ wirtψ(z) = 1 2m+ℓ

ρ(m+ℓ)(Re(z)) resp. m wirt ℓ wirtψ(z) = 1 2m+ℓ

ρ(m+ℓ)(Im(z)) .

Since ρ is non-polynomial we can choose a real number x, such that ρ(n)(x) = 0 for all n N0 (cf. for instance [20, p. 53]). Thus, m wirt ℓ wirtψ(z) = 0 for all m, ℓ N0, whenever z C with Re(z) = x, or Im(z) = y, respectively.

Next, we consider the mod Re LU which was defined in Section 2. Theorem A.3. Let b ( , 0). Writing σ = σmod Re LU,b, one has for every z C with |z| + b > 0 the identity

m wirt ℓ wirtσ (z) =

|z|, m = ℓ= 0, 1 + b

2 1 |z|, m = 1, ℓ= 0,

b qm,ℓ zℓ m+1

|z|2ℓ+1 , m ℓ+ 1, ℓ 1,

b qm,ℓ zm ℓ 1

|z|2m 1 , m ℓ+ 1, m 2

for every m N0 and ℓ N0. Here, the numbers qm,ℓare non-zero and rational. Furthermore, note that all cases for choices of m and ℓare covered, by observing that we can either have the case m ℓ+ 1 (where either m 2 or m = 1, ℓ= 0) or m ℓ+ 1, where the latter case is again split into ℓ= 0 and ℓ 1.

Proof. We first calculate certain Wirtinger derivatives for z = 0. First note

+ i (0,1) 1

2 Re(z) + i 2 Im(z)

and similarly

for any m N. Using the product rule for Wirtinger derivatives (see Appendix A.1), we see

|z|m + zℓ wirt

|z|m+2 (A.2)

for any m N and ℓ N0 and furthermore

= wirt zℓ 1 |z|m + zℓ wirt

= ℓ zℓ 1 1 |z|m zℓ m

for m, ℓ N, and finally

|z|m + zℓ wirt

|z|m+2 (A.4)

for m N and ℓ N0.

Having proven those three identities, we can start with the actual computation. We first fix m = 0 and prove the claimed identity by induction over ℓ. The case ℓ= 0 follows directly from the definition of the mod Re LU and for ℓ= 1, we note

wirtσ(z) = wirt(z) | {z } =0

(A.2) = b 1

which is the claimed form. Then, using induction, we compute

ℓ+1 wirtσ(z) = wirt

b q0,ℓ zℓ+1

(A.2) = b q0,ℓ 2ℓ+ 1

| {z } =:q0,ℓ+1

so that the case m = 0 is complete.

Now we deal with the case m ℓ+ 1. The case ℓ= 0 and m = 1 is proven by computing

wirtσ(z) = wirt(z) + b wirt

(A.3) = 1 + b 1

so we can assume ℓ> 0. Since we already dealt with the case m = 0, we can inductively assume the claim to be true for a fixed m ℓ. Then we compute m+1 wirt ℓ wirtσ (z) = wirt

b qm,ℓ zℓ m+1

(A.3) = b qm,ℓ m + 1

| {z } =:qm+1,ℓ

which is of the desired form. Note that (A.3) is indeed applicable because ℓ m + 1 1.

Finally, we consider the case where m ℓ+ 1 and m 2. The case m = ℓ+ 1 has already been shown. Using induction, we see m+1 wirt ℓ wirtσ (z) = wirt

δ(m,ℓ)=(1,0) + b qm,ℓ zm ℓ 1

(A.4) = b qm,ℓ m + 1

| {z } =:qm+1,ℓ

so the proof is complete.

From Theorem A.3 we can now deduce that the mod Re LU is smooth and non-polyharmonic on some non-empty open subset of C. Corollary A.4. Let b ( , 0) and z C with |z| > b. Then we have

m wirt ℓ wirtσmod Re LU,b(z) = 0

for all m, ℓ N0. In particular, σmod Re LU,b is smooth and non-polyharmonic on the set

{z C : |z| > b}.

Furthermore, σmod Re LU,b is continuous on the entire complex plane.

Proof. The first part follows from Theorem A.3 by noting that if |z| > b we have in particular |z| > b/2 and |z| > 0. For the second part, we first note that σmod Re LU,b is trivially continuous in z C if |z| = b. Hence, we assume |z| = b. Take any sequence (zj)j N with zj z as j and note

|σmod Re LU,b(zj) σmod Re LU,b(z)| = |σmod Re LU,b(zj)| 0 0, if |zj| < b, |zj| + b |z| + b = 0, if |zj| b.

This shows the continuity of σmod Re LU,b.

Next we analyze the complex cardioid, which was defined in Section 2. Theorem A.5. For any z C with z = 0 and any m, ℓ N0 we have

m wirt ℓ wirt card(z) =

4 z2 |z| + |z|

4 , m = ℓ= 0,

|z|2ℓ 1 + bm,ℓzℓ+2 m

|z|2ℓ+1 , m ℓ = 0,

8 z |z| + 3

8 z |z|, m = ℓ+ 1 = 1, am,ℓ z |z|2ℓ+1 + bm,ℓ z |z|2ℓ+1 , m = ℓ+ 1 > 1,

am,ℓ zm ℓ |z|2m 1 + bm,ℓzm ℓ 2

|z|2m 3 , m ℓ+ 2.

Here, the numbers am,ℓand bm,ℓare non-zero and rational. Furthermore, note that all cases for possible choices of m and ℓare covered: The case m ℓis split into ℓ= 0 and ℓ = 0. The case m = ℓ+ 1 is split into m = 1 and m > 1. Then, the case m ℓ+ 2 remains.

Proof. For the following we always assume z C with z = 0. Then we can apply the identity cos( z) = Re(z)

|z| , so we can rewrite

card(z) = 1

4 (z + z) z

This establishes the case m = ℓ= 0. Next, we compute

wirt (|z|) = 1

2 z |z| (A.6)

and similarly

wirt (|z|) = 1

z |z|. (A.7)

wirt card(z) (A.2),(A.6) = 1 4 1

| {z } =:b0,1

|z|3 + 1 8 |{z} =:a0,1

Using induction, we derive

ℓ+1 wirt card(z) = a0,ℓ wirt

+ b0,ℓ wirt

(A.2) = a0,ℓ 2ℓ 1

| {z } =:a0,ℓ+1

|z|2ℓ+1 + b0,ℓ 2ℓ+ 1

| {z } =:b0,ℓ+1

so the claim has been shown if m = 0. If we now fix any ℓ N and assume that the claim holds true for some m N0 with m < ℓ, we get

m+1 wirt ℓ wirt card(z) = am,ℓ wirt

+ bm,ℓ wirt

(A.3) = am,ℓ 1

| {z } =:am+1,ℓ

|z|2ℓ 1 + bm,ℓ 3

| {z } =:bm+1,ℓ

so the claim holds true if m ℓ.

The case m = ℓ+ 1 is split into the case m = 1 and m > 1. If m = 1, then ℓ= 0 and we compute

wirt card(z) (A.3),(A.5),(A.7) = 1 2 + 1

If m > 1, we get

ℓ+1 wirt ℓ wirt card(z) = aℓ,ℓ wirt

+ bℓ,ℓ wirt

(A.3),(A.4) = aℓ,ℓ 2ℓ 1

| {z } =:aℓ+1,ℓ

z |z|2ℓ+1 + bℓ,ℓ 2 2ℓ+ 1

| {z } =:bℓ+1,ℓ

z |z|2ℓ+1 .

Next, we deal with the case m = ℓ+ 2: Here, we see

ℓ+2 wirt ℓ wirt card(z) = wirt

2δℓ=0 + aℓ+1,ℓ

z |z|2ℓ+1 + bℓ+1,ℓ z |z|2ℓ+1

(A.3),(A.4) = aℓ+1,ℓ 2ℓ+ 1

| {z } =:aℓ+2,ℓ

|z|2ℓ+3 + bℓ+1,ℓ 1 2ℓ+ 1

| {z } =:bℓ+2,ℓ

1 |z|2ℓ+1 .

If we assume the claim to be true for a fixed m ℓ+ 2, we get

m+1 wirt ℓ wirt card(z) = am,ℓ wirt

+ bm,ℓ wirt

(A.4) = am,ℓ 2m 1

| {z } =:am+1,ℓ

|z|2m+1 + bm,ℓ 2m 3

| {z } =:bm+1,ℓ

Hence, using induction, we have proven the claimed identity.

The statement regarding the non-polyharmonicity of the complex cardioid is formulated in the following corollary. Corollary A.6. For every z C with z / R i R and every m, ℓ N0 we have

m wirt ℓ wirt card(z) = 0.

In particular, card is smooth and non-polyharmonic on C\(R i R). Futhermore, card is continuous on the entire complex plane.

Proof. We start with the first part of the corollary. For the following, let z C with z / R i R, i.e., Re(z) = 0 and Im(z) = 0. Using the definition of card, we see

card(z) = 0 z = 0 or cos( z) = 1 z R 0,

and thus, card(z) = 0 since Im(z) = 0. In the case m ℓ = 0 we use Theorem A.5 to get the following chain of implications:

m wirt ℓ wirt card(z) = 0 am,ℓ+ bm,ℓ z2

|z|2 = 0 z2 = |z|2 am,ℓ

bm,ℓ z2 R z R i R

holds, and thus m wirt ℓ wirt card(z) = 0 for z / R i R.

For the case m = ℓ+ 1 = 1, note

wirt card(z) = 0 1

8 Im(z) = 0 Im(z) = 0,

and thus wirt card(z) = 0 since z / R.

For m = ℓ+ 1 > 1, we see by considering the realand imaginary parts that

ℓ+1 wirt ℓ wirt card(z) = 0 am,ℓz + bm,ℓz = 0

Re(z),Im(z) =0 am,ℓ+ bm,ℓ= 0 and am,ℓ+ bm,ℓ= 0 am,ℓ= bm,ℓ= 0,

and hence ℓ+1 wirt ℓ wirt card(z) = 0, since am,ℓ = 0 = bm,ℓby Theorem A.5.

Therefore, it remains to consider the case m ℓ+ 2. But here we easily see

m wirt ℓ wirt card(z) = 0 am,ℓ

|z|2 + bm,ℓ= 0 z2 R z R i R,

and hence m wirt ℓ wirt card(z) = 0. Since all cases have been considered, the claim follows.

Regarding the second part of the corollary, first note that card is trivially continuous on C \ {0}. Further note that

| card(z)| = 1 2

as z 0, showing the continuity of card on the entire complex plane C.

B Postponed proofs concerning the approximation of polynomials

B.1 Divided Differences

Divided differences are well-known in numerical mathematics as they are for example used to calculate the coefficients of an interpolation polynomial in its Newton representation. In our case, we are interested in divided differences since they can be used to obtain a generalization of the classical mean-value theorem for differentiable functions: Given an interval I R and n + 1 pairwise distinct data points x0, ..., xn I as well as an n-times differentiable real-valued function f : I R, there exists ξ (min {x0, ..., xn} , max {x0, ..., xn}) such that

f [x0, ..., xn] = f (n)(ξ)

where the left-hand side is a divided difference of f (defined below). The classical mean-value theorem is obtained in the case n = 1. Our goal in this section is to generalize this result to a multivariate setting by considering divided differences in multiple variables. Such a generalization is probably well-known, but since we could not locate a convenient reference and to make the paper more self-contained, we provide a proof.

Let us first define divided differences formally. Given n+1 data points (x0, y0) , ..., (xn, yn) R R with pairwise distinct xk, we define the associated divided differences recursively via

[yk] := yk, k {0, ..., n},

[yk, ..., yk+j] := [yk+1, ..., yk+j] [yk, ..., yk+j 1]

xk+j xk , j {1, ..., n}, k {0, ..., n j}.

If the data points are defined by a function f (i.e. yk = f (xk) for all k {0, ..., n}), we write

[xk, ..., xk+j] f := [yk, ..., yk+j] .

We first consider an alternative representation of divided differences, the so-called Hermite-Genocchi Formula. To state it, we introduce the notation Σk to denote a certain k-dimensional simplex.

Definition B.1. Let s N. Then we define

x Rs : xℓ 0 for all ℓand

We further set Σ0 := {0} R. Denoting by λs the s-dimensional Lebesgue-measure (and by λ0 the counting measure), the identity λs (Σs) = 1

s! holds true; see for instance [42] and note that the case s = 0 can be seen directly.

In the following, we consider integrals over the sets Σs. These integrals are always formed with respect to the measure λs. However, we only write, e.g., R

Σs f(x)dx, with the implicit understanding that the integral is formed with respect to the measure λs. Using this convention, we can now consider the alternative representation of divided differences.

Lemma B.2 (Hermite-Genocchi-Formula). Let k N0. For real numbers a, b R with a < b, a function f Ck([a, b]; R) (where C0([a, b]; R) denotes the set of continuous functions) and pairwise distinct x0, ..., xk [a, b], the divided difference of f at the data points x0, ..., xk satisfies the identity

[x0, ..., xk] f = Z

ℓ=1 sℓ(xℓ x0)

Proof. The case k = 0 follows directly from [x0]f = f(x0). For the case k > 0, see [6, Theorem 3.3].

We will make use of the above formula by combining it with the following generalization of the mean-value theorem for integrals. Lemma B.3. Let D be a connected and compact topological space. Let A be some σ-algebra over D and let µ : A [0, ) be a finite measure on D with µ(D) > 0. Let f : D R be a continuous function with respect to the standard topology on R and the topology on D. Moreover, let f be measurable with respect to A and the Borel σ-algebra on R. Then there exists ξ D such that

f(ξ) = 1 µ(D) Z

D f(x)dµ(x).

Proof. Since D is compact and connected and f continuous, it follows that f(D) R is compact and connected, and hence f(D) is a compact interval in R. Therefore, there exist xmin D and xmax D satisfying f (xmin) f(x) f (xmax) for all x D. Thus, one gets

f (xmin) = 1 µ(D)

D f(xmin)dµ(x) 1 µ(D)

D f(x)dµ(x)

D f(xmax)dµ(x) = f (xmax) .

The claim now follows from the fact that f(D) is an interval.

We also get a convenient representation of divided differences for the case of equidistant data points. Lemma B.4. Let f : R R, x0 R, k N0 and h > 0. We consider the case of equidistant data points, meaning xj := x0 + jh for all j = 1, ..., k. In this case, we have the formula

[x0, ..., xk] f = 1 k!hk

r=0 ( 1)k r k r

f (xr) . (B.2)

Proof. We prove the result via induction over the number j of considered data points, meaning the following: For all j {0, ..., k} we have

[xℓ, ..., xℓ+j] f = 1 j!hj

r=0 ( 1)j r j r

for all ℓ {0, ..., k} satisfying ℓ+ j k. The case j = 0 is trivial. Therefore, we assume the claim to be true for a fixed j {0, ..., k 1}, and let ℓ {0, ..., k} be arbitrary with ℓ+ j + 1 k. We then get

[xℓ, ..., xℓ+j+1] f = [xℓ+1, ..., xℓ+j+1] f [xℓ, ..., xℓ+j] f

I.H. = 1 j!hj Pj r=0( 1)j r j r (f (xℓ+r+1) f (xℓ+r)) (j + 1)h

= 1 (j + 1)!hj+1

r=0 ( 1)j r j r

(f (xℓ+r+1) f (xℓ+r)) .

Using an index shift, we deduce

r=0 ( 1)j r j r

r=0 ( 1)j r j r

r=1 ( 1)j+1 r j r 1

r=0 ( 1)j+1 r j r

= ( 1)j+1f (xℓ) +

( 1)j+1 rf (xℓ+r) j r 1

+ f (xℓ+j+1)

r=0 ( 1)j+1 r j + 1 r

which yields the claim.

The final result for divided differences, which is the result that is actually used in the proof of Theorem 3.1 in Appendix B.3, reads as follows: Theorem B.5. Let f : Rs R and k N0, r > 0, such that f ( r,r)s Ck (( r, r)s; R). For p Ns 0 with |p| k and h > 0 let

fp,h := (2h) |p| X

0 r p ( 1)|p| |r| p r

f (h(2r p)) .

Let m := max j pj. Then, for 0 < h < r max{1,m} there exists ξ h[ m, m]s satisfying

fp,h = pf(ξ).

Proof. We may assume m > 0, since m = 0 implies p = 0 and thus fp,h = f(0), so that the claim holds for ξ = 0.

We prove via induction over s N that the formula

fp,h = p! Z

ℓ=1 ℓσ(1) ℓ, ..., hps + 2h

ℓ=1 ℓσ(s) ℓ

dσ(1) dσ(s)

(B.3) holds for all p Ns 0 with 1 |p| k and all 0 < h < r m. The case s = 1 is exactly the Hermite-Genocchi-Formula (B.1), combined with (B.2) applied to the data points

hp, hp + 2h, ..., hp 2h, hp.

By induction, assume that the claim holds for some s N. For a fixed point y ( r, r), let

fy : ( r, r)s R, x 7 f(x, y).

For p Ns+1 0 with |p| k and p := (p1, ..., ps), we define

Γ : ( r, r) R, y 7 (fy)p ,h = (2h) |p | X

0 r p ( 1)|p | |r | p

f (h(2r p ), y) .

Using the induction hypothesis, we get

Σp1 (p ,0)f

ℓ=1 ℓσ(1) ℓ, ..., hps + 2h

ℓ=1 ℓσ(s) ℓ, y

dσ(1) dσ(s)

for all y ( r, r). Furthermore, we calculate

ps+1! [ h ps+1, h ps+1 + 2h, ..., h ps+1]Γ

(B.2) = (2h) ps+1

r =0 ( 1)ps+1 r ps+1 r

Γ h 2r ps+1

= (2h) ps+1 " ps+1 X

r =0 ( 1)ps+1 r ps+1 r

0 r p ( 1)|p | |r | p

f h(2r p ), h(2r ps+1) #

= (2h) |p| X

0 r p ( 1)|p| |r| p r

f (h(2r p))

On the other hand, we get

[ h ps+1, h ps+1 + 2h, ..., h ps+1]Γ

Σps+1 Γ(ps+1)

ℓ=1 ℓσ(s+1) ℓ

ℓ=1 ℓσ(1) ℓ, ..., hps+1 + 2h

ℓ=1 ℓσ(s+1) ℓ

dσ(1) dσ(s+1).

Interchanging the order of integration and derivative is possible, since we integrate on compact sets and only consider continuously differentiable functions (see, e.g., [11, Lemma 16.2]).

We have thus proven (B.3) using the principle of induction. The claim of the theorem then follows directly using the mean-value theorem for integrals (Lemma B.3) applied to the topological space

D := Σp1 Σps+1

equipped with the product topology where each factor is endowed with the standard topology on Σpℓ (where the standard topology on Σ0 is the discrete topology), the measure

µ := λp1 λps+1

defined on the product of the Borel σ-algebras on Σpℓ(and the σ-algebra of {0} is its power set) and to the function

(σ(1), ..., σ(s+1)) 7 pf

ℓ=1 ℓσ(1) ℓ, ..., hps+1 + 2h

ℓ=1 ℓσ(s+1) ℓ

Moreover, note that all the simplices Σpℓare compact and connected (in fact convex) with

λp1(Σp1) λps+1(Σps+1) = 1

see Definition B.1. Therefore, Lemma B.3 yields the existence of a certain (ξ(1), ..., ξ(s+1)) D with

ℓ=1 ℓξ(1) ℓ, ..., hps+1 + 2h

ℓ=1 ℓξ(s+1) ℓ

Hence, the claim follows letting

ℓ=1 ℓξ(1) ℓ, ..., hps+1 + 2h

ℓ=1 ℓξ(s+1) ℓ

B.2 Proof of Proposition 2.1

Proof of Proposition 2.1. Let M N0. Since ϕ is not polyharmonic we can pick z U with Mϕ(z) = 0. By continuity we can choose δ > 0 with Bδ(z) U and Mϕ(w) = 0 for all w Bδ(z). For all m, ℓ N0 let

Am,ℓ:= n w Bδ(z) : m wirt ℓ wirtϕ(w) = 0 o

and assume towards a contradiction that [

m,ℓ M Am,ℓ= Bδ(z).

By [3, Corollary 3.35], Bδ(z) with its usual topology is completely metrizable. By continuity of m wirt ℓ wirtϕ the sets Am,ℓare closed in Bδ(z). Hence, using the Baire category theorem [3, Theorems 3.46 and 3.47], there are m, ℓ N0 with m, ℓ M, z Am,ℓand ε > 0 such that

Bε (z ) Am,ℓ Bδ(z).

But thanks to Mϕ = 4M M wirt M wirtϕ = 4M M ℓ wirt M m wirt ℓ wirt m wirtϕ (see property 6 in Appendix A.1), this directly implies Mϕ 0 on Bε (z ) in contradiction to the choice of Bδ(z).

B.3 Proof of Theorem 3.1

The following section is dedicated to proving Theorem 3.1. We are going to show that it is possible to approximate complex polynomials in z and z arbitrarily well on Ωn using shallow complex-valued neural networks. To do so, we follow the proof sketch given after the statement of Theorem 3.1, starting with the following lemma. Lemma B.6. Let ϕ : C C and δ > 0, b C, k N0, such that ϕ Bδ(b) Ck (Bδ(b); C). For fixed z Ωn, where we recall that Ωn = [ 1, 1]n + i[ 1, 1]n, we consider the map

2n (0) C, w 7 ϕ w T z + b ,

which is in Ck since for w B δ

2n (0) Cn we have

w T z w 2 z 2 < δ

For all multi-indices m, ℓ Nn 0 with |m + ℓ| k we have

m wirt ℓ wirtϕz(w) = zmzℓ |m| wirt |ℓ| wirtϕ w T z + b

for all w B δ

Proof. First we prove the statement

ℓ wirtϕz(w) = zℓ ( |ℓ| wirtϕ) w T z + b for all w B δ

2n (0) (B.4)

by induction over 0 |ℓ| k. The case ℓ= 0 is trivial. Assuming that (B.4) holds for fixed ℓ Nn 0 with |ℓ| < k, we want to show

ℓ+ej wirt ϕz(w) = zℓ+ej

|ℓ|+1 wirt ϕ w T z + b (B.5)

for all w B δ

2n (0), where j {1, ..., n} is chosen arbitrarily. To this end, first note

ℓ+ej wirt ϕz(w) = ej wirt ℓ wirtϕz(w) induction = ej wirt h w 7 zℓ

|ℓ| wirtϕ w T z + b i

= zℓ ej wirt h w 7

|ℓ| wirtϕ w T z + b i .

Applying the chain rule for Wirtinger derivatives and using that

ej wirt w 7 w T z + b = 0

since w 7 w T z + b is holomorphic in every variable, we see

ej wirt h w 7

|ℓ| wirtϕ w T z + b i = wirt |ℓ| wirtϕ w T z + b ej wirt w 7 w T z + b

|ℓ|+1 wirt ϕ w T z + b ej wirt h w 7 w T z + b i

|ℓ|+1 wirt ϕ w T z + b ej wirt [w 7 w T z + b]

|ℓ|+1 wirt ϕ w T z + b ,

using the fact that wj 7 w T z + b is holomorphic and hence

ej wirt w 7 w T z + b = 0 and ej wirt w 7 w T z + b = zj.

Thus, we have proven (B.5) and induction yields (B.4).

It remains to show the full claim. We use induction over |m| and note that the case m = 0 has just been shown. We assume that the claim holds true for fixed m Nn 0 with |m + ℓ| < k and choose j {1, ..., n}. Thus, we get

m+ej wirt ℓ wirtϕz(w) = ej wirt m wirt ℓ wirtϕz(w) IH= ej wirt w 7 zmzℓ |m| wirt |ℓ| wirtϕ w T z + b

= zmzℓ ej wirt h w 7 |m| wirt |ℓ| wirtϕ w T z + b i .

Using the chain rule again, we calculate

ej wirt h w 7 |m| wirt |ℓ| wirtϕ w T z + b i

= |m|+1 wirt |ℓ| wirtϕ w T z + b ej wirt w 7 w T z + b

+ |m| wirt |ℓ|+1 wirt ϕ w T z + b ej wirt h w 7 w T z + b i

= zej |m|+1 wirt |ℓ| wirtϕ w T z + b + |m| wirt |ℓ|+1 wirt ϕ w T z + b ej wirt [w 7 w T z + b]

= zej |m|+1 wirt |ℓ| wirtϕ w T z + b .

By induction, this proves the claim.

As the last preparation for the proof of Theorem 3.1, we need the following lemma. Lemma B.7. Let ϕ : C C and δ > 0, b C, k N0, such that ϕ Bδ(b) Ck (Bδ(b); C). Let m, n N and ε > 0. For p N2n 0 , h > 0 and z Ωn we write

Φp,h(z) := (2h) |p| X

0 r p ( 1)|p| |r| p r

(ϕz φn) (h(2r p))

= (2h) |p| X

0 r p ( 1)|p| |r| p r

ϕ [φn (h(2r p))]T z + b ,

where ϕz is as introduced in Lemma B.6 and φn is as in (A.1). Furthermore, let

ϕp : Ωn B δ

2n (0) C, (z, w) 7 pϕz(w).

Then there exists h > 0 such that

Φp,h ϕp( , 0) L (Ωn;C) ε

for all p N2n 0 with |p| k and p m and h (0, h ).

Proof. Fix p N2n 0 with |p| k and p m. The map

2n+1(0) Bδ/(

2n+1)(0) C, (z, w) 7 ϕ w T z + b

is in Ck since w T z w z < δ

2n + 1) = δ.

Therefore, the map

2n+1(0) Bδ/(

2n+1)(0) C, (z, w) 7 pϕz(w)

is continuous and in particular uniformly continuous on the compact set

Ωn Bδ/(3n)(0) B

2n+1(0) Bδ/(

Here, we employed

2n + 1 < 3n for every n N. Hence, there exists hp (0, δ 3n

2n m), such that |ϕp(z, ξ) ϕp(z, 0)| ε

2 for all ξ φn h [ m, m]2n , h (0, hp) and z Ωn. Now fix such an h (0, hp) and z Ωn. Applying Theorem B.5 to both components of φ 1 1 Φp,h (z) and φ 1 1 ϕz φn ( δ

separately yields the existence of two real vectors ξ1, ξ2 h [ m, m]2n, such that φ 1 1 Φp,h(z)

1 = p φ 1 1 ϕz φn (ξ1)

1 and φ 1 1 Φp,h(z)

2 = p φ 1 1 ϕz φn (ξ2)

Rewriting this yields

Re (Φp,h(z)) = Re (ϕp(z, φn (ξ1))) and Im (Φp,h(z)) = Im (ϕp(z, φn (ξ2))) .

Using this property, we deduce

|Re (Φp,h(z) ϕp(z, 0))| = |Re (ϕp(z, φn (ξ1)) ϕp(z, 0))| |ϕp(z, φn (ξ1)) ϕp(z, 0)| ε

and analogously for the imaginary part. Since z Ωn and h (0, hp) have been chosen arbitrarily we get the claim by choosing

h := min hp : p N2n 0 with |p| k and p m .

Using the previous two lemmas and the results from Appendix B.1 and Appendix B.2, we can now prove Theorem 3.1.

Proof of Theorem 3.1. Let b U satisfy

ℓ1 wirt ℓ2 wirtϕ(b) = 0 for all ℓ1, ℓ2 N0 with ℓ1, ℓ2 mn.

Such a point b exists according to Proposition 2.1. Let p P and fix m, ℓ Nn 0 with m, ℓ m. For each z Ωn, using Lemma B.6, we then have

zmzℓ= h |m| wirt |ℓ| wirtϕ (b) i 1 m wirt ℓ wirtϕz(0)

Prop. A.1 = h |m| wirt |ℓ| wirtϕ (b) i 1 X

p=(p ,p ) N2n 0 p +p =m+ℓ

bp ,p (p ,p )ϕz(0) (B.6)

with suitably chosen complex coefficients bp ,p C depending only on p , p , m and ℓ. Here we used that |m|, |ℓ| mn. Since P Pn m is bounded and p P , we can write

m,ℓ Nn 0 m,ℓ m

with |am,ℓ| c for some constant c = c(P ) > 0. In combination with (B.6), this easily implies that we can rewrite p as p(z) = X

p N2n 0 p 2m

cp(p) pϕz(0) (B.7)

with coefficients cp(p) C satisfying |cp(p)| c for some constant c = c (ϕ, b, P , m, n). By Lemma B.7, we choose h > 0, such that

|Φp,h (z) pϕz(0)| ε P

q N2n 0 ,q 2m c

for all z Ωn and p N2n 0 with p 2m. Furthermore, we can rewrite each function Φp,h as

Φp,h (z) = X

α Z2n |αj | 2m j

λα,pϕ([φn (h α)]T z + b)

with suitable coefficients λα,p C. Since the cardinality of the set α Z2n : |αj| 2m j

is (4m + 1)2n, this can be converted to

j=1 λj,pϕ ρT j z + b .

For p as in (B.7), we then define

p N2n 0 p 2m

cp(p) Φp,h (z) =

p N2n 0 p 2m

ϕ(ρT j z + b)

and note |θ(z) p(z)| X

p 2m |cp(p)| |Φp,h (z) pϕz(0)| ε.

Since the coefficients ρj have been chosen independently of the polynomial p, we can rewrite θ in the desired form.

C Postponed proofs for the approximation of Ck-functions

C.1 Prerequisites from Fourier Analysis

This section is dedicated to reviewing some notations and results from Fourier Analysis. In the end, a quantitative result for the approximation of Ck ([ 1, 1]s; R)-functions using linear combinations of multivariate Chebyshev polynomials is derived; see Theorem C.15.

We start by recalling several notations and concepts from Fourier Analysis. Definition C.1. Let s N and k N0. We define

Ck 2π (Rs; C) := f Ck (Rs; C) : p Zs x Rs : f(x + 2πp) = f(x) .

and C2π (Rs; C) := C0 2π (Rs; C). For a function f Ck 2π (Rs; C) we write

f Ck([ π,π]s;C) := max k Ns 0 |k| k

kf L ([ π,π]s;C) and

f Lp([ π,π]s;C) :=

[ π,π]s |f(x)|p dx

for p [1, ).

Moreover, we set f L ([ π,π]s;R) := f C0([ π,π]s;C).

Definition C.2. For any s N and k Zs, we write

ek : Rs C, ek(x) = ei k,x

where , denotes the usual inner product of two vectors. For any f C2π (Rs; C) we define the k-th Fourier coefficient of f to be

ˆf(k) := 1 (2π)s

[ π,π]s f(x)ek(x)dx.

Definition C.3. For two functions f, g C2π (Rs; C), we define their convolution as

f g : Rs C, (f g)(x) := 1 (2π)s

[ π,π]s f(t)g(x t)dt.

In the following we define several so-called kernels. Definition C.4. Let m N0 be arbitrary.

1. The m-th one-dimensional Dirichlet-kernel is defined as

2. The m-th one-dimensional Fejèr-kernel is defined as

3. The m-th one-dimensional de-la-Vallée-Poussin-kernel is defined as

Vm := (1 + em + e m) Fm.

4. Let s N. We extend the above definitions to dimension s by letting

Ds m (x1, ..., xs) :=

p=1 Dm (xp) ,

F s m (x1, ..., xs) :=

p=1 Fm (xp) ,

V s m (x1, ..., xs) :=

p=1 Vm (xp) .

We need the following property of the multivariate extension of the de-la-Vallée-Poussin-kernel. Proposition C.5. Let m, s N. Then one has V s m L1([ π,π]s;C) 3s.

Proof. From [34, Exercise 1.3 and Lemma 1.4] it follows Fm L1([ π,π];C) = 1 and hence using the triangle inequality Vm L1([ π,π];C) 3. The claim then follows using Tonelli s theorem.

The following definition introduces the term of trigonometric polynomial. Definition C.6. For any s N and m N0 we call a function of the form

Rs C, x 7 X

k Zs 0 m k m

with coefficients ak C a trigonometric polynomial of coordinatewise degree at most m and denote the space of all those functions with Hs m. Here, we consider the sum over all k Zs with m kj m for all j {1, ..., s}. We then write

Es m(f) := min T Hsm f T L (Rs;C) (C.1)

for any function f C2π (Rs; C).

The following proposition shows that convolving with the Fejèr kernel produces a trigonometric polynomial of degree at most 2m 1, while reproducing trigonometric polynomials of degree m. Furthermore, the norm of the convolution operator is bounded uniformly in m. These properties will be useful for our proof of Theorem C.15. Proposition C.7. Let s, m N and k N0. The map

vm : C2π (Rs; C) Hs 2m 1, f 7 f V s m

is well-defined and satisfies vm(T) = T for all T Hs m. (C.2) Furthermore, there exists a constant c = c(s) > 0 (independent of m), such that

vm(f) Ck([ π,π]s;C) c f Ck([ π,π]s;C) f Ck 2π (Rs; C) ,

vm(f) L ([π,π]s;C) c f L ([ π,π]s;C) f C2π (Rs; C) . (C.3)

In fact, it holds c(s) exp(C s) with an absolute constant C > 0.

Proof. A direct computation shows that f ek = ˆf(k) ek. This implies that vm is well-defined since V s m is a trigonometric polynomial of coordinatewise degree at most 2m 1.

The operator is bounded on Ck 2π(Rs; C) and C2π(Rs; C) with norm at most c = 3s, as follows from Young s inequality [34, Lemma 1.1 (ii)], Proposition C.5, and the fact that one has for all k Ns 0 with |k| k the identity

k (f V s m) = kf V s m for f Ck 2π(Rs; C).

It remains to show that vm is the identity on Hs m. We first prove that

ek Vm = ek (C.4)

holds for all k Z with |k| m. First note that

ek Vm = ek Fm + ek (em Fm) + ek (e m Fm) .

We then compute

ℓ=0 Dℓ ek = 1

h= ℓ eh ek | {z } =δk,h ek

ℓ=|k| ek = m |k|

Similarly, we get

ek (em Fm) = 1

ℓ=0 (em Dℓ) ek = 1

h= ℓ eh+m ek | {z } =δk,h+m ek

0 ℓ m 1 ℓ m k

ek = δk 1 k

ek (e m Fm) = 1

ℓ=0 (e m Dℓ) ek = 1

h= ℓ eh m ek | {z } =δk,h m ek

0 ℓ m 1 ℓ k+m

ek = δk 1 k

Adding up those three identities yields (C.4).

To finally prove (C.2), it clearly suffices to show that

ek V s m = ek

for all k Zs with m k m. But for such k, using ek(x) = Qs j=1 ekj (xj), one obtains

(ek V s m) (x) = 1 (2π)s

j=1 ekj (tj) Vm (xj tj) dt

ekj Vm (xj) (C.4) =

j=1 ekj (xj) = ek(x)

for any x Rs, as was to be shown.

The following result follows from a theorem in [29]. Proposition C.8. Let s, k N. Then there exists a constant c = c(s, k) > 0, such that, for Es m as defined in (C.1), Es m(f) c mk f Ck([ π,π]s;R)

for all m N and f Ck 2π (Rs; R). In fact, it holds c(s, k) exp(C ks) kk with an absolute constant C > 0.

Proof. We apply [29, Theorem 6.6] with ni = m and pi = k, which yields the existence of a constant c1 = c1(s, k) > 0, such that

for all m N and f Ck 2π(Rs; R), where ωℓ(f, ) denotes the modulus of continuity of kf

xk ℓwith respect to xℓ, where we have the trivial bound

2 f Ck([ π,π]s;R) .

Hence, we get

Es m(f) c1 s 2 f Ck([ π,π]s;R) 1 mk ,

so the claim follows by choosing c := 2s c1.

We refer to Appendix C.2 (see Theorem C.18) for a proof of the claimed bound on the constant c(s, k).

The above proposition bounds the best possible error of approximating f by trigonometric polynomials of coordinatewise degree at most m, but this is in general non-constructive. Our next result shows that a similar bound holds for approximating f by vm(f).

Theorem C.9. Let s N. Then there exists a constant c = c(s) > 0, such that the operator vm from Proposition C.7 satisfies f vm(f) L (Rs) c Es m(f)

for any m N and f C2π (Rs; C). In fact, it holds c(s) exp(C s) with an absolute constant C > 0.

Proof. For any T Hs m one has

f vm(f) L (Rs)

(C.2) f T L (Rs)+ vm(T) vm(f) L (Rs)

(C.3) (c+1) f T L (Rs) .

Taking the infimum over all T Hs m yields the claim.

By combining Proposition C.8 and Theorem C.9, we immediately get the following bound.

Corollary C.10. Let s, k N0. Then there exists a constant c = c(s, k) > 0, such that

f vm(f) L (Rs) c mk f Ck([ π,π]s;R)

for every m N and f Ck 2π (Rs; R). In fact, we have c(s, k) exp(C ks) kk with an absolute constant C > 0.

Up to now, we have studied the approximation of periodic functions by trigonometric polynomials, but our actual goal is to approximate non-periodic functions by algebraic polynomials. The next lemma establishes a connection between the two settings. Lemma C.11. Let k N0 and s N. For any function f Ck ([ 1, 1]s; C), we define the corresponding periodic function via

f : Rs C, f (x1, ..., xs) = f(cos (x1) , ..., cos (xs))

and note f Ck 2π (Rs; C). The map

Ck ([ 1, 1]s; C) Ck 2π (Rs; C) , f 7 f

is a continuous linear operator with respect to the Ck-norms on Ck ([ 1, 1]s; C) and Ck 2π (Rs; C). The operator norm can be bounded from above by kk.

Proof. The map is well-defined since cos is a smooth function and 2π-periodic. The linearity of the operator is obvious, so it remains to show its continuity.

The goal is to apply the closed graph theorem [21, Theorem 5.12]. By definition of f , and since cos : [ π, π] [ 1, 1] is surjective, we have the equality f L ([ 1,1]s;C) = f L ([ π,π]s;C). Let then (fn)n N be a sequence of functions fn Ck ([ 1, 1]s; C) and g Ck 2π (Rs; C) such that fn f in Ck ([ 1, 1]s; C) and f n g in Ck 2π (Rs; C). We then have

f g L ([ π,π]s) f f n L ([ π,π]s) + f n g L ([ π,π]s) = f fn L ([ 1,1]s;C) + f n g L ([ π,π]s) f fn Ck([ 1,1]s;C) + f n g Ck([ π,π]s;C) 0 (n ).

It follows f = g and the closed graph theorem yields the desired continuity.

We refer to Appendix C.2 (see Theorem C.19 and Remark C.20) for a proof of the claimed bound on the operator norm.

For a function f Ck([ 1, 1]s; C) we want to express vm(f ) in a convenient way, involving a product of cosines. To this end, we make use of the following identity, which is a generalization of the well-known product-to-sum formula for cos. Lemma C.12. Let s N. Then it holds for any x Rs that

j=1 cos(xj) = 1

σ { 1,1}s cos( σ, x ).

Proof. This is an inductive generalization of the product-to-sum formula

2 cos(x) cos(y) = cos(x y) + cos(x + y) (C.5)

for x, y R, which can be found for instance in [1, Eq. 4.3.32]. The case s = 1 holds since cos is an even function. Assume that the claim holds for a fixed s N and take x Rs+1. Writing

x = (x1, ..., xs), we derive

j=1 cos(xj) =

σ { 1,1}s cos( σ, x )

σ { 1,1}s cos( σ, x ) cos(xs+1)

(C.5) = 1 2s+1 X

σ { 1,1}s [cos( σ, x + xs+1) + cos( σ, x xs+1)]

σ { 1,1}s+1 cos( σ, x ),

as was to be shown.

The following proposition states that vm(f ) can be expressed as a linear combination of products of cosines. This representation is useful since these cosines can be interpolated by Chebyshev polynomials which in the end leads to the desired approximation result. Proposition C.13. Let s N and k N0. For any f Ck ([ 1, 1]s; C) and m N the de-la Vallée-Poussin operator given as f 7 vm (f ) with vm as in Proposition C.7 and f 7 f as in Lemma C.11 has a representation

vm (f ) (x1, ..., xs) = X

k Ns 0 k 2m 1

j=1 cos (kjxj)

for continuous linear functionals

Vm k : Ck ([ 1, 1]s; C) C, f 7 2 k 0 am k c f (k),

where k 0 = #{j {1, ..., s} : kj = 0} and am k = c V s m(k). Furthermore, if f Ck([ 1, 1]s; R), then Vm k (f) R for every k Ns 0 with k 2m 1.

Proof. First of all, it is easy to see that vm (f ) is even in each variable, which follows directly from the fact that f and V s m are both even in each variable. Furthermore, if we write

k Zs (2m 1) k 2m 1

with appropriately chosen coefficients am k R, we easily see

vm (f ) = X

k Zs (2m 1) k 2m 1

am k c f (k)ek.

Using Euler s identity and the fact that vm (f ) is an even function, we get the representation

vm (f ) (x) = X

k Zs (2m 1) k 2m 1

am k c f (k)cos( k, x )

for all x Rs. Using to denote the componentwise product of two vectors of the same size, i.e., x y = (xi yi)i, and using the identity k, σ x = σ, k x , we see since vm (f ) is even in

every variable that

vm (f ) (x) = 1

σ { 1,1}s vm (f ) (σ x)

k Zs (2m 1) k 2m 1

am k c f (k)cos( k, σ x )

k Zs (2m 1) k 2m 1

am k c f (k) 1

σ { 1,1}s cos( σ, k x )

Lemma C.12 = X

k Zs (2m 1) k 2m 1

am k c f (k)

j=1 cos (kjxj)

k Ns 0 k 2m 1

2 k 0am k c f (k)

j=1 cos (kjxj)

with k 0 := # j {1, ..., s} : kj = 0 . In the last step we again used that cos is an even function and that

c f (k) = c f (σ k)

for all σ { 1, 1}s, which also follows easily since f and V s m are even in every component. Letting

Vm k (f) := 2 k 0am k c f (k),

we have the desired form. The fact that Vm k is a continuous linear functional on Ck 2π ([ 1, 1]s; C) follows directly since f 7 c f (k) is a continuous linear functional for every k. If f is real-valued, so is c f (k) for every k Ns 0 with k 2m 1, since f is real-valued and even in every component.

Our main approximation result involves linear combinations of Chebyshev polynomials where the coefficients in this linear combination are given as Vm k (f). It is therefore important to be able to bound the sum of the absolute values |Vm k (f)|. Lemma C.14. Let s N. There exists a constant c = c(s) > 0, such that the inequality X

k Ns 0 k 2m 1

|Vm k (f)| c ms/2 f L ([ 1,1]s;C)

holds for all m N and f C ([ 1, 1]s; C), where Vm k is as in Proposition C.13. In fact, we have c(s) exp(C s) with an absolute constant C > 0.

Proof. Let f C ([ 1, 1]s; C) and m N. For any multi-index ℓ Ns 0, it follows from Proposition C.13 that \ vm (f )(ℓ) = X

k Ns 0 k 2m 1

Vm k (f) bgk(ℓ),

gk : Rs R, (x1, ..., xs) 7

j=1 cos (kjxj) .

Now, a calculation using Fubini s theorem and using

1 2 ekj + e kj = Y

1 j s kj =0

1 2 ekj + e kj

for any number k N0 shows

bgk(ℓ) = 1 2 k 0 , k = ℓ, 0, otherwise for k, ℓ Ns 0.

Therefore, we have the bound |Vm ℓ(f)| 2s \ vm (f )(ℓ) for ℓ Ns 0 with |ℓ| 2m 1. Using the Cauchy-Schwarz and the Parseval inequality, we therefore see

k Ns 0 k 2m 1

|Vm k (f)| 2s X

k Ns 0 k 2m 1

\ vm (f )(k) CS 2s (2m)s/2

k Ns 0 k 2m 1

\ vm (f )(k) 2

Parseval 2s 2s/2 ms/2 vm (f ) L2([ π,π]s;C)

2s 2s/2 | {z } =:c1(s) ms/2 vm (f ) L ([ π,π]s;C) .

Using Proposition C.7, we get a constant c2(s) exp(C0 s) such that X

k Ns 0 k 2m 1

|Vm k (f)| c1(s) c2(s) ms/2 f L ([ π,π]s;C) = c(s) ms/2 f L ([ 1,1]s;C) ,

as claimed.

For any natural number ℓ N0, we denote by Tℓthe ℓ-th Chebyshev polynomial, satisfying

Tℓ(cos(x)) = cos(ℓx), x R.

One can show that Tℓis in fact a polynomial of degree ℓ. For a multi-index k Ns 0, we define

j=1 Tkj (xj) , x Rs.

We then get the following approximation result about approximating (non-periodic) Ck-functions by linear combinations of Chebyshev polynomials. Theorem C.15. Let k, s, m N. Then there exists a constant c = c(s, k) > 0 with the following property: For any f Ck ([ 1, 1]s; R) the polynomial P defined as

k Ns 0 k 2m 1

Vm k (f) Tk(x),

with Vm k as in Proposition C.13, satisfies

f P L ([ 1,1]s;R) c mk f Ck([ 1,1]s;R) .

Here, the maps C ([ 1, 1]s; R) R, f 7 Vm k (f) are continuous and linear functionals with respect to the L -norm. Furthermore, there exists a constant c = c(s) > 0, such that the inequality X

k Ns 0 k 2m 1

|Vm k (f)| c ms/2 f L ([ 1,1]s;R)

holds for all f C ([ 1, 1]s; R). Moreover, we have c(s, k) exp(C ks) k2k and c(s) exp(C s) with an absolute constant C > 0.

Proof. We choose the constant c0 = c0(s, k) according to Corollary C.10. Let f Ck ([ 1, 1]s; R) be arbitrary. Then we define the corresponding function f Ck 2π (Rs; R) as above. Let P be defined as in the statement of the theorem. Then it follows from the definition of the Chebyshev polynomials Tk, the definition of P, and the formula for vm(f ) from Proposition C.13 that

P (x) = vm (f ) (x)

is satisfied, where P is the corresponding function to P defined similarly to f . Overall, we get the bound

f P L ([ 1,1]s;R) = f P L ([ π,π]s;R)

Cor. C.10 c0 mk f Ck([ π,π]s;R) .

The first claim then follows using the continuity of the map f 7 f as proven in Lemma C.11. The second part of the theorem has already been proven in Lemma C.14.

C.2 Details on bounding the constant c in Theorem 3.2

In this appendix we provide details on the bound of the constant c that appears in the formulation of Theorem 3.2. Specifically, we perform a careful investigation of several results from [29] to get an upper bound for the constant appearing in Proposition C.8. Moreover, we analyze the operator norm of the operator

Ck([ 1, 1]s; C) Ck 2π(Rs; C), f 7 f with f (x) := f(cos(x1), ..., cos(xs))

appearing in Lemma C.11 and show that it is bounded from above by kk.

We start with the analysis of some bounds in [29, Chapter 4.3]. Here, a generalization of Jackson s kernel is defined for any m, r N as

Lm,r(t) := λ 1 m,r sin(mt/2)

where λm,r is chosen such that Z

[ π,π] Lm,r(t) dt = 1.

The first two important bounds are provided in the following proposition. Proposition C.16. Let m, r N. Then it holds

λ 1 m,r exp(C r) m1 2r and Z

[0,π] tk Lm,r(t) dt exp(C r) m k

for any k 2r 2, with an absolute constant C > 0.

Proof. Since Lm,r 0 and since sin(t/2) t/2 for t [0, π], we get

2r dt = 22r Z

2r du m2r 1 Z

Here, we employed the inequality sin(u) 2

πu for u [0, π/2] in the penultimate step.1 This shows the first part of the claim.

1To see that this inequality holds, note that sin (u) = sin(u) 0 for u [0, π/2], so that sin is concave on that interval, and hence sin(u) = sin((1 2u

For the second part, we again use the estimate sin(u) 2

πu for u [0, π/2] to compute Z

[0,π] tk Lm,r(t) dt = λ 1 m,r Z

[0,π] tk sin(mt/2)

2r dt λ 1 m,r Z

[0,π] tk sin(mt/2)

= λ 1 m,r π2r Z

[0,π] tk 2r sin(mt/2)2r dt

= λ 1 m,r π2r Z

k 2r sin(u)2r du 2

λ 1 m,r π2r m2r 1 k Z

[0,πm/2] uk 2r sin(u)2r du

exp(C1 r) Z

[0, ) uk 2r sin(u)2r du m k

with an absolute constant C1 > 0. Here, we employed the first part of this proposition. It remains to bound the integral. This is done via Z

[0, ) uk 2r sin(u)2r du = Z

[0,1] uk 2r sin(u)2r du + Z

[1, ) uk 2r sin(u)2r du

[0,1] uk sin(u)

[1, ) u 2 du C2

with an absolute constant C2 > 0. This proves the claim.

The proof in [29] proceeds by defining

Km,r(t) := Lm ,r(t), m = jm

Proposition C.16 shows for k 2r 2 that Z

[0,π] tk Km,r(t) dt exp(C r) (m ) k.

r we infer Z

[0,π] tk Km,r(t) dt exp(C r) r

k exp(C r) rk m k (C.6)

with an absolute constant C > 0.

We can now quantify the constant appearing in [29, Theorem 4.3]. Theorem C.17 (cf. [29, Theorem 4.3]). Let k, m N and f Ck 2π(R; R). Let

ω(f (k), 1/m) := max x R,|t| 1/m|f (k)(x + t) f (k)(x)|.

Then it holds E1 m(f) (exp(C k) kk) m k ω(f (k), 1/m). Here, we recall that E1 m(f) denotes the best possible approximation error when approximating f using trigonometric polynomials of degree m; see Equation (C.1).

Proof. We follow the proof of [29, Theorem 4.3]. Take r = k + 1 and define

[ π,π] Km,r(t)

ℓ=1 ( 1)ℓ k + 1 ℓ

f(x + ℓt) dt.

Then it is shown in the proof of [29, Theorem 4.3] that Im is a trigonometric polynomial of degree at most m and that

|f(x) Im(x)| 2 ωk+1(f, 1/m) Z

[0,π] (mt + 1)k+1Km,r(t) dt.

Here, ωk+1(f, 1/m) denotes the modulus of smoothness of f as defined on [29, p. 47]. The integral can be bounded via Z

[0,π] (mt + 1)k+1Km,r(t) dt

[0,1/m] (mt + 1 | {z } 2 )k+1Km,r(t) dt + Z

[1/m,π] (mt + 1 | {z } 2mt )k+1Km,r(t) dt

[ π,π] Km,r(t) dt

+2k+1mk+1 Z

[0,π] tk+1Km,r(t) dt

(C.6) 2k+1 + 2k+1mk+1 exp(C1 r) (k + 1)k+1 m (k+1) r 2k exp(C k) kk

with absolute constants C, C1 > 0. Since ωk+1(f, 1/m) m k ω(f (k), 1/m) follows from [29, Equation 3.6(5)], the claim is shown.

Therefore, we can bound the constant appearing in [29, Theorem 4.3] by exp(C k) kk. It remains to deal with the approximation of multivariate periodic functions by multivariate trigonometric polynomials which is contained in [29, Theorem 6.6]. Theorem C.18 (cf. [29, Theorem 6.6]). Let s, k N and f Ck 2π(Rs; R). Let ωj denote the modulus of continuity of kf

xk j for j = 1, ..., s. Then, with Es m as introduced in Equation (C.1), it holds

Es m(f) exp(C ks) kk m k s X

j=1 ωj(1/m),

with an absolute constant C > 0.

Proof. We follow the proof of [29, Theorem 6.6] with pj = k and nj = m for every index j = 1, ..., s. For j = 1, ..., s + 1 define the set Tj consisting of all functions g C2π(Rs; R) that are a trigonometric polynomial in xℓof degree at most m for ℓ< j; in xℓfor ℓ j they should have continuous partial derivatives pg

xp ℓfor 0 p k; the modulus of continuity of kg

xk ℓshould not

exceed 2Kjωj, where Kj = (j 1)(k + 1) 2ks for j > 1 and K1 = 1 2ks. Then it is shown that if j {1, ..., s} and fj Tj there exists a function fj+1 Tj+1 for which

fj fj+1 L (Rs;R) exp(C1 k) kk m k 22ks ωj(1/m) exp(C2 ks) kk m k ωj(1/m) for absolute constants C1, C2 > 0. This is an application of Theorem C.17. Hence, defining f1 := f, we see

f fs+1 L (Rs;R)

j=1 fj fj+1 L (Rs;R) exp(C2 ks) kk m k

j=1 ωj(1/m).

Therefore, we have shown that the constant appearing in [29, Theorem 6.6] can be bounded from above by exp(C ks) kk.

In the rest of this section, we discuss the operator norm of the operator defined in Lemma C.11. In Lemma C.11 the closed graph theorem is used to show that the operator is bounded. However, the closed graph theorem does not provide any bound on the norm of the operator. Therefore, in order to quantify the operator norm, we need to apply a different technique, which is Faa di Bruno s formula. This formula is a generalization of the chain rule to higher order derivatives. Theorem C.19. Let s, k N. We define the operator

T : Ck([ 1, 1]s; C) Ck 2π([ π, π]s; C), (Tf)(x1, ..., xs) := f(cos(x1), ..., cos(xs)).

Let α Ns 0 with |α| k. Then, for any f Ck([ 1, 1]s; C), we have

α(Tf) L ([π,π]s;C)

j=1 ααj j f C|α|([ 1,1]s).

Proof. The proof is by induction over s. The case s = 1 is an application of Faa di Bruno s formula: We can write Tf = f g with g(x) = cos(x). We then take ℓ N0 with ℓ k and some x [ π, π]. The set partition version of Faa di Bruno s formula (see for instance [24, p. 219]) then yields (f g)(ℓ)(x) X

f (|π|)(g(x)) Y

Here, Πℓdenotes the set of all partitions of the set {1, ..., ℓ}. Since all derivatives of g are bounded by 1 in absolute value and |π| ℓfor every partition π Πℓwe get

(f g)(ℓ) L ([ π,π];C) |Πℓ| f Cℓ([ 1,1];C).

The number |Πℓ| is the number of possible partitions of the set {1, ..., ℓ} and is the so-called ℓ-th Bell number. It can be bounded from above by ℓℓ(see [13, Theorem 2.1]). This proves the case s = 1.

We now assume that the claim holds for an arbitrary but fixed s N. Take α Ns+1 0 with |α| k. We decompose α = (α , αs+1) with α Ns 0. For a fixed variable ys+1 [ 1, 1], we define

fys+1(y1, ..., ys) := f(y1, ..., ys, ys+1) for (y1, ..., ys) [ 1, 1]s.

We denote g(x1, ..., xs+1) := (cos(x1), ..., cos(xs+1)), gs(x1, ..., xs) := (cos(x1), ..., cos(xs)) and θ(xs+1) := cos(xs+1). For every (x1, ..., xs+1) [ π, π]s+1 it then holds

(f g)(x1, ..., xs+1) = fθ(xs+1) gs (x1, ..., xs).

We now differentiate f g with respect to the multiindex α and get

[ α(f g)] (x1, ..., xs+1) = αs+1

h α fθ(xs+1) gs (x1, ..., xs) i

= (hx1,...,xs θ)(αs+1)(xs+1)

where we define

hx1,...,xs(ys+1) := α fys+1 gs (x1, ..., xs) for (x1, ..., xs) [ π, π]s and ys+1 [ 1, 1].

Using the case s = 1, we get

|[ α(f g)] (x1, ..., xs+1)| = (hx1,...,xs θ)(αs+1)(xs+1) ααs+1 s+1 hx1,...,xs Cαs+1([ 1,1];C)

for any fixed (x1, ..., xs) [ π, π]s.

It remains to bound hx1,...,xs Cαs+1([ 1,1];C). To this end, we fix ℓ N0 with ℓ αs+1. We further denote

Fy1,...,ys(ys+1) := f(y1, ..., ys, ys+1) for (y1, ..., ys+1) [ 1, 1]s+1.

For arbitrary (x1, ..., xs) [ π, π]s and ys+1 [ 1, 1] we then see

h(ℓ) x1,...,xs(ys+1) = α h (x1, ..., xs) 7 F (ℓ) gs(x1,...,xs)(ys+1) i = α Hys+1 gs (x1, ..., xs)

where Hys+1(y1, ..., ys) := F (ℓ) y1,...,ys(ys+1) for (y1, ..., ys) [ 1, 1]s. Hence, we see by induction that h(ℓ) x1,...,xs(ys+1) = α Hys+1 gs (x1, ..., xs) IH

j=1 ααj j Hys+1 C|α |([ 1,1]s;C)

j=1 ααj j f C|α|([ 1,1]s+1;C)

as was to be shown.

Remark C.20. For a multiindex α Ns 0 with |α| k we see

j=1 ααj j k Ps j=1 αj kk.

Hence, the norm of the operator introduced in Lemma C.11 can be bounded from above by kk.

C.3 Proof of Theorem 3.2

For any natural number ℓ N0, we denote by Tℓthe ℓ-th Chebyshev polynomial, satisfying

Tℓ(cos(x)) = cos(ℓx), x R.

For a multi-index k Ns 0 we define

j=1 Tkj (xj) , x [ 1, 1]s.

The proof of Theorem 3.2 relies on the fact that Ck-functions can by approximated at a certain rate using linear combinations of the Tk (see Theorem C.15). We also refer to Figure 4 for an illustration of the overall proof strategy of Theorem 3.2.

Proof of Theorem 3.2. Choose M N as the largest integer for which (16M 7)2n m, where we assume without loss of generality that 92n m, which can be done by choosing σj = 0 for all j {1, ..., m} for m < 92n, at the cost of possibly enlarging c. First we note that by the choice of M the inequality m (16M + 9)2n

holds true. Since 16M + 9 25M, we get m 252n M 2n or equivalently

25 M. (C.7)

According to Theorem C.15 we choose a constant c1 = c1(n, k) with the property that for any function f Ck [ 1, 1]2n; R there exists a polynomial

0 k 2M 1 VM k (f) Tk

of coordinatewise degree at most 2M 1 satisfying

f P L ([ 1,1]2n;R) c1

M k f Ck([ 1,1]2n;R) .

Furthermore, according to Theorem C.15, we choose a constant c2 = c2(n), such that the inequality X

VM k (f) c2 M n f L ([ 1,1]2n;R) c2 M n f Ck([ 1,1]2n;R)

holds for all f Ck [ 1, 1]2n; R . The final constant is then defined to be

c = c(n, k) :=

2 25k (c1 + c2) .

Fix k 2M 1. Since Tk is a polynomial of componentwise degree less or equal to 2M 1 with φn as in (A.1), we have a representation

Tk φ 1 n (z) = X

ℓ1,ℓ2 Nn 0 ℓ1,ℓ2 2M 1

t=1 Re (zt)ℓ1 t Im (zt)ℓ2 t

with suitably chosen coefficients ak ℓ1,ℓ2 C. By using the identities Re (zt) = 1

2 (zt + zt) and also Im (zt) = 1

2i (zt zt) we can rewrite Tk φ 1 n into a complex polynomial in z and z, i.e., Tk φ 1 n (z) = X

ℓ1,ℓ2 Nn 0 ℓ1,ℓ2 4M 2

bk ℓ1,ℓ2zℓ1zℓ2

with complex coefficients bk ℓ1,ℓ2 C. Using Theorem 3.1, we choose ρ1, ..., ρm Cn and b C, such that for any polynomial P Tk φ 1 n : k 2M 1 Pn 4M 2 there are coefficients σ1(P), ..., σm(P) C, such that

g P P L (Ωn;C) M k n, (C.8)

t=1 σt(P)ϕ ρT t z + b .

Note that here we implicitly use the bound (4 (4M 2) + 1)2n m. We are now going to show that the chosen constant and the chosen vectors ρt have the desired property.

Let f Ck (Ωn; C). By splitting f into real and imaginary part, we write f = f1 + i f2 with f1, f2 Ck (Ωn; R). For the following, fix j {1, 2} and note that fj φn Ck [ 1, 1]2n; R . By choice of c1, there exists a polynomial P with the property

fj φn P L ([ 1,1]2n;R) c1

M k fj φn Ck([ 1,1]2n;R)

or equivalently fj P φ 1 n L (Ωn;R) c1

M k fj Ck(Ωn;R) , (C.9)

where P φ 1 n can be written in the form P φ 1 n (z) = X

0 k 2M 1 VM k (fj φn) Tk φ 1 n (z).

We choose the function g Tk φ 1 n according to (C.8). Thus, writing

0 k 2M 1 VM k (fj φn) g Tk φ 1 n ,

we obtain P φ 1 n gj L (Ωn;R) X

VM k (fj φn) Tk φ 1 n g Tk φ 1 n

L (Ωn;R) | {z } M k n

VM k (fj φn)

M k fj φn Ck([ 1,1]2n;R) = c2

M k fj Ck(Ωn;R) . (C.10)

Combining (C.9) and (C.10), we see

fj gj L (Ωn;R) c1 + c2

M k fj Ck(Ωn;R) c1 + c2

M k f Ck(Ωn;C) .

In the end, define g := g1 + i g2. Since the vectors ρt have been chosen fixed, it is clear that, after rearranging, g has the desired form, i.e., g = σT Φ where Φ(z) = (ϕ(ρtz + b))m t=1. Furthermore, one obtains the bound

f g L (Ωn;C) q

f1 g1 2 L (Ωn;R) + f2 g2 2 L (Ωn;R)

f 2 Ck(Ωn;C) + f 2 Ck(Ωn;C)

2 (c1 + c2)

M k f Ck(Ωn;C) .

Using (C.7), we see

f g L (Ωn;C)

2 25k (c1 + c2)

mk/2n f Ck(Ωn;C) ,

as desired.

The linearity and continuity of the maps f 7 σj(f) (with respect to the L -norm) follow easily from the fact that the map f 7 VM k (f) is a continuous linear functional for every multiindex 0 k 2M 1.

Approximation of partial derivatives

Approximation of polynomials

Approximation of Ck-functions

Divided Differences Wirtinger derivatives of w 7 ϕ(w T z + b)

Chebyshev polynomials

Figure 4: Schematic for the proof of the main result (Theorem 3.2). The first row shows the different steps of the proof and the second row indicates the main tools used.

D Approximation of holomorphically extendable functions

In this appendix we provide the proofs for the statements contained in Remark 3.3. The proofs mainly rely on results about sparse polynomial approximation [2]. Definition D.1 (cf. [2, Assumption 2.3]). Let s N and ν (1, )s. For every j {1, ..., s} let

Eνj := z + z 1

2 : z C, 1 |z| νj

We then define the (filled-in) Bernstein polyellipse of parameter ν as

Eν := Eν1 Eνs Cs

and observe that [ 1, 1]s Eν when we interpret [ 1, 1]s as a subset of Cs. We further define

Vs(ν) := f : [ 1, 1]s C : open U Eν and f : U C holomorphic with f [ 1,1]s = f .

Here, [ 1, 1]s is again interpreted as a subset of Cs. Moreover, note that such an extension f is, if existent, unique, as follows from the identity theorem for holomorphic functions. Hence, the expression f Vs(ν) := f L (Eν;C) is well-defined. Definition D.2 (cf. [2, pp. 25, 28 ff.]). Let s N. We define a probability measure on [ 1, 1]s via

We define the normalized Chebyshev polynomials for k Ns 0 as

f Tk(x) := 2 k 0/2 s Y

j=1 cos(kj arccos(xj)),

where k 0 := #{1 j s : kj = 0}. Note that this definition differs slightly from the notion used in Appendices C.1 and C.3.

The following lemma is crucial for deriving the approximation rate in Theorem D.5. The proofs can be found in [2]. Lemma D.3 (cf. [2, Remark 2.15 and Theorem 3.2]). Let s N, k Ns 0, ν (1, )s and f Vs(ν). Then it holds

1. e Tk L ([ 1,1]s;R) = 2 k 0/2, where k 0 = #{1 j s : kj = 0};

2. | e Tk, f µs| ν k 2 k 0/2 f Vs(ν).

It is a well-known fact that the Chebyshev polynomials f Tk form an orthonormal basis of L2 µs([ 1, 1]s; C). The following proposition states that functions from Vs(ν) can even be approximated uniformly by linear combinations of the f Tk at a certain rate. The proof follows essentially by applying Lemma D.3.

Proposition D.4. Let s, m N and ν (1, )s. Let ν := min j=1,...,sνj. Then there exists a constant

c = c(s, ν) > 0 with the following property: For every f Vs(ν), defining

f, e Tk µs e Tk,

it holds f Pm L ([ 1,1]s;C) c ν m f Vs(ν).

Proof. Let f Vs(ν). Since the f Tk form an orthonormal basis of L2 µs([ 1, 1]s; C) and since µs is a probability measure, so that f C([ 1, 1]s; C) L2 µ2([ 1, 1]s; C), it follows that

k Ns 0 f, f Tk µs f Tk

with unconditional convergence in L2 µs. For x [ 1, 1]s, note that X

k Ns 0 | f, f Tk µs| |f Tk(x)| X

k Ns 0 | f, f Tk µs| f Tk L ([ 1,1]s;C) 2s f Vs(ν) X

k Ns 0 ν k.

Here, we employed Lemma D.3 at the last inequality. From ν > 1 it follows that X

k Ns 0 f, f Tk µs f Tk

converges pointwise (even uniformly) and, since all the involved functions are continuous, this pointwise limit then has to coincide with the L2 µs-limit f. Hence, it holds

f Pm L ([ 1,1]s;C) X

| f, f Tk µs| f Tk L ([ 1,1]s;C)

2s f Vs(ν) X

k Ns 0 |k| m

where we again used Lemma D.3. To complete the proof we compute

k Ns 0 |k| m

k Ns 0 kj>m

k=m+1 ν k j

ν (m+1) j Y

j=1 ν (m+1) j

and define c(s, ν) := 2s s Qs ℓ=1 νj νj 1 ν 1.

To formulate the result for the approximation of holomorphically extendable functions using CVNNs we need to transfer the definition of Vs(ν) to the complex setting. For ν (1, )2n and with φn as in Equation (A.1), we hence write

Wn(ν) := n f : Ωn C : f φn [ 1,1]2n V2n(ν) o

For f Wn(ν) we define f Wn(ν) := f φn V2n(ν).

Thus, Wn(ν) consists of all complex-valued functions defined on Ωn that can be holomorphically extended onto some polyellipse in C2n, where Ωn Cn is interpreted as a subset of R2n and then as a subset of C2n. The final approximation result then reads as follows. Theorem D.5. Let n N and ν (1, )2n. Set

ν := min 1 j 2nνj.

Then there exists a constant c = c(n, ν) > 0 with the following property: For every function ϕ : C C that is smooth and non-polyharmonic on some open set = U C and for every m N there exists a first layer Φ Fϕ n,m with the property that for every f Wn(ν) there exist coefficients σ = σ(f) Cm such that

f σT Φ L (Ωn;C) c ν m1/(2n)/17 f Wn(ν).

Moreover, the map f 7 σ(f) is a continuous linear functional with respect to the L -norm.

Proof. Choose M N as the largest integer satisfying (8M + 1)2n m, where we assume without loss of generality that 92n m. This can be done by choosing σ = 0 for m < 92n at the cost of possibly enlarging c. Note that the maximality of M implies (8M + 9)2n > m. By using 8M + 9 17M, this gives us

17 m1/(2n). (D.1)

Choose the constant c1 = c1(2n, ν) according to Proposition D.4.

Fix k N2n 0 with k M. Since f Tk is a polynomial of componentwise degree at most M, we have a representation f Tk φ 1 n (z) = X

ℓ1,ℓ2 Nn 0 ℓ1,ℓ2 M

t=1 Re (zt)ℓ1 t Im (zt)ℓ2 t

with suitably chosen coefficients ak ℓ1,ℓ2 C. By using the identities Re (zt) = 1

2 (zt + zt) and also

Im (zt) = 1

2i (zt zt), we can rewrite f Tk φ 1 n into a complex polynomial in z and z, i.e., f Tk φ 1 n (z) = X

ℓ1,ℓ2 Nn 0 ℓ1,ℓ2 2M

bk ℓ1,ℓ2zℓ1zℓ2

with complex coefficients bk ℓ1,ℓ2 C. Using Theorem 3.1, we choose ρ1, ..., ρm Cn and

b C, such that for any polynomial P n f Tk φ 1 n : k M o Pn 2M there exist coefficients

σ1(P), ..., σm(P) C, such that

g P P L (Ωn;C)

k N2n 0 ν k

ν (M+1), (D.2)

t=1 σt(P)ϕ ρT t z + b .

Note that here we implicitly use the bound (4 (2M) + 1)2n m. We are now going to show that the first layer Φ Fϕ n,m defined using the ρt and b (i.e., Φ(z) = (ϕ(ρT t z + b))m t=1) has the desired property.

To this end, take an arbitrary function f Wn(ν). Proposition D.4 tells us that

f PM φ 1 n L (Ωn;C) = f φn PM L ([ 1,1]2n;C) c1 ν (M+1) f φn V2n(ν), (D.3)

where PM φ 1 n = X

k N2n 0 k M

f φn, f Tk µ2n (f Tk φ 1 n ).

We then define the approximating network

k N2n 0 k M

f φn, f Tk µ2n gf Tk φ 1 n .

From the definition of the functions gf Tk φ 1 n it follows directly that g is a shallow CVNN with first layer Φ Fϕ n,m. Furthermore, it holds

PM φ 1 n g L (Ωn;C) X

k N2n 0 k M

| f φn, f Tk µ2n| f Tk φ 1 n gf Tk φ 1 n L (Ωn;C)

(D.2),Lem. D.3 2n f φn V2n(ν) X

k N2n 0 k M

2n f Wn(ν) ν (M+1).

Combining this estimate with Equation (D.3) and applying the triangle inequality gives us

f g L (Ωn;C) (c1 + 2n) ν (M+1) f L (Eν;C).

Hence, the claim follows taking c := c1 + 2n and using (D.1).

The continuity of the linear map f 7 σ(f) with respect to the L -norm follows directly from the fact that f 7 f φn, f Tk µ2n is trivially continuous with respect to the L2 µ2n-norm and hence also continuous with respect to the L -norm since µ2n is a probability measure.

E Postponed proofs for the optimality results in the case of continuous weight selection

In this section we provide the proofs for the optimality results derived if a continuous weight selection is assumed. Specifically, we prove Theorems 4.1 and 5.1. The proofs of both results rely on a very general result about the approximation in normed spaces by subsets that are parametrizable with m parameters [19]. We decided to include a detailed proof for this approximation result in this paper since the nature of the continuity assumption in [19] is not completely clear.

Proposition E.1 ([19, Theorem 3.1]). Let (X, X) be a normed space, = K X a subset and V X a linear, not necessarily closed subspace of X containing K. Let m N, let a : K Rm be a map which is continuous with respect to some norm V on V and M : Rm X some arbitrary map. Let

bm(K)X := sup Xm+1 sup {ϱ 0 : Uϱ(Xm+1) K} , (E.1)

where the first supremum is taken over all (m + 1)-dimensional linear subspaces Xm+1 of X and

Uϱ(Xm+1) := {y Xm+1 : y X ϱ}.

Further, we set bm(K)X := 0 if the supremum in (E.1) is not well-defined as a quantity in [0, ]. Then it holds sup x K x M(a(x)) X bm(K)X.

Proof. The claim is trivial if bm(K)X = 0. Thus, assume bm(K)X > 0. Let 0 < ϱ bm(K)X be any number such that there exists an (m+1)-dimensional subspace Xm+1 of X with Uϱ(Xm+1) K. It follows Uϱ(Xm+1) V , hence Xm+1 V , so V defines a norm on Xm+1. Thus, the

restriction of a to Uϱ(Xm+1) is a continuous mapping to Rm with respect to V . Since all norms are equivalent on the finite-dimensional space Xm+1, the Borsuk-Ulam-Theorem [18, Corollary 4.2] yields the existence of a point x0 Uϱ(Xm+1) with a(x0) = a( x0). We then see

2ϱ = 2 x0 X x0 M(a(x0)) X + x0 + M(a( x0)) X = x0 M(a(x0)) X + x0 M(a( x0)) X,

and hence at least one of the two summands on the right has to be larger than or equal to ϱ.

E.1 Proof of Theorem 4.1

Using Proposition E.1, we can deduce our lower bound in the context of Ck-spaces. The proof is in fact almost identical to what is done in [19, Theorem 4.2]. However, we decided to include a detailed proof in this paper, since [19] considers Sobolev functions and not Ck-functions.

Theorem E.2. Let s, k N. Then there exists a constant c = c(s, k) > 0 with the following property: For any m N and any map a : Ck([ 1, 1]s; R) Rm that is continuous with respect to some norm on Ck([ 1, 1]s; R) and any (possibly discontinuous) map M : Rm C([ 1, 1]s; R), we have

sup f Ck([ 1,1]s;R) f Ck([ 1,1]s;R) 1

f M(a(f)) L ([ 1,1]s;R) c m k/s.

Proof. The idea is to apply Proposition E.1 to X := C([ 1, 1]s; R), V := Ck([ 1, 1]s; R) and the set K := {f Ck([ 1, 1]s; R) : f Ck([ 1,1]s;R) 1}.

Assume in the beginning that m = ns with an integer n > 1. Pick ϕ C (Rs) with ϕ 1 on [ 3/4, 3/4]s and ϕ 0 outside of [ 1, 1]s. Fix c0 = c0(s, k) > 0 with

1 ϕ Ck([ 1,1]s;R) c0.

Let Q1, ..., Qm be the partition (disjoint up to null-sets) of [ 1, 1]s into closed cubes of sidelength 2/n. For every j {1, ..., m} we write Qj = s ℓ=1[a(j) ℓ 1/n, a(j) ℓ + 1/n] with an appropriately

chosen vector a = (a(j) 1 , ..., a(j) s ) [ 1, 1]s and let

ϕj(x) := ϕ(n(x a(j))) for x Rs.

By choice of ϕ, the maps ϕj are supported on a proper subset of Qj for every j {1, ..., m} and an inductive argument shows

kϕj(x) = n|k| ( kϕ)(n(x a(j))) for every k Ns 0 and x Rs

and hence in particular ϕj Ck([ 1,1]s;R) n|k| c0. (E.2)

Let Xm := span{ϕ1, ..., ϕm} and S U1(Xm) = {f Xm : f L ([ 1,1]s;R) 1}. Then we can write S in the form S = Pm j=1 cjϕj with real numbers c1, ..., cm R. Suppose there exists j {1, ..., m} with |cj | > 1. Then we have

S L ([ 1,1]s;R) |S(a(j ))| |cj | > 1,

since the functions ϕj have disjoint support and ϕj(a(j)) = 1. This is a contradiction to S U1(Xm) and we can thus infer that max j |cj| 1. Furthermore, we see again because the functions ϕj have

disjoint support that

k S L ([ 1,1]s;R) max j |cj| kϕj L ([ 1,1]s;R)

(E.2) n|k| c0 c0 nk = c0 mk/s

for every k Ns 0 with |k| k and hence

S Ck([ 1,1]s;R) c0 mk/s.

Thus, letting ϱ := c 1 0 m k/s yields Uϱ(Xm) K, so we see by Proposition E.1 that

sup f K f Mm 1(a(f)) L ([ 1,1]s;R) ϱ = c1 m k/s

with c1 = c 1 0 for every map a : X Rm 1 which is continuous with respect to some norm on V and any map Mm 1 : Rm 1 X. Using the inequality m 2(m 1) (note m > 1) we get

sup f K f Mm 1(a(f)) L ([ 1,1]s;R) c1 m k/s c1 (2(m 1)) k/s c2 (m 1) k/s

with c2 = c1 2 k/s. Hence, the claim has been shown for all numbers m of the form ns 1 with an integer n > 1.

In the end, let m N be arbitrary and pick n N with ns m < (n + 1)s. For given maps a : V Rm and M : Rm X with a continuous with respect to some norm on V , let

a : V R(n+1)s 1, f 7 (a(f), 0) and

M(n+1)s 1 : R(n+1)s 1 X, (x, y) 7 M(x),

where x Rm, y R(n+1)s 1 m. Then we get sup f K f M(a(f)) L ([ 1,1]s;R) = sup f K f M(n+1)s 1( a(f)) L ([ 1,1]s;R)

c2 ((n + 1)s 1) k/s c2 (2sns) k/s c3 m k/s

with c3 = c2 2 k. Here we used the bound (n + 1)s 1 (2n)s. This proves the full claim.

Using this theorem, we can now prove Theorem 4.1.

Proof of Theorem 4.1. Let a : Ck(Ωn; C) Cm be any map that is continuous with respect to some norm V on Ck(Ωn; C), and let M : Cm C(Ωn; C) be arbitrary. With φn, φm defined as in Equation (A.1), let

a : Ck([ 1, 1]2n; R) R2m, f 7 φ 1 m

a f φ 1 n Ωn

Clearly, a is continuous on Ck([ 1, 1]2n; R) with respect to the norm V on Ck([ 1, 1]2n; R) defined as f V := f φ 1 n Ωn

V for f Ck([ 1, 1]2n; R).

Let f M : R2m C([ 1, 1]2n; R), f M(x) := Re(M(φm(x))) φn [ 1,1]2n.

Then it holds sup f Ck(Ωn;C) f Ck(Ωn;C) 1

f M(a(f)) L (Ωn;C)

sup f Ck(Ωn;R) f Ck(Ωn;R) 1

f Re(M(a(f))) L (Ωn;R)

= sup f Ck([ 1,1]2n;R)

f Ck([ 1,1]2n;R) 1

f φ 1 n Re M

a f φ 1 n Ωn

= sup f Ck([ 1,1]2n;R)

f Ck([ 1,1]2n;R) 1

f Re M φm φ 1 m

a f φ 1 n Ωn

φn L ([ 1,1]2n;R)

= sup f Ck([ 1,1]2n;R)

f Ck([ 1,1]2n;R) 1

f f M( a( f)) L ([ 1,1]2n;R) c (2m) k/(2n),

with a constant c = c(n, k) provided by Theorem E.2. Hence, the claim follows by choosing c = c(n, k) := 2 k/(2n) c.

As a corollary, we formulate a special case of Theorem 4.1 for the case of shallow complex-valued neural networks. Corollary E.3. Let n, k N. Then there exists a constant c = c(n, k) > 0 with the following property: For any m N, ϕ C(C; C) and any map

η : Ck (Ωn; C) (Cn)m Cm Cm, g 7 (η1(g), η2(g), η3(g))

which is continuous with respect to some norm on Ck(Ωn; C), there exists f Ck (Ωn; C) satisfying f Ck(Ωn;C) 1 and f Ψ(f) L (Ωn;C) c m k/(2n),

where Ψ(f) C(Ωn; C) is given by

j=1 (η3(f))j ϕ [η1(f)]T j z + (η2(f))j .

Proof. Using Theorem 4.1, we deduce that there exists f Ck(Ωn; C) satisfying f Ck(Ωn;C) 1 and f Ψ(f) L (Ωn;C) c (m(n + 2)) k/(2n)

for a constant c = c (n, k) > 0. Hence, the claim follows by letting c := c (n + 2) k/(2n).

E.2 Proof of Theorem 5.1

We can use Proposition E.1 not only to show that the rate of convergence established in this paper is optimal (which is done in Appendix E.1) but also to show that the problem of approximating Ck-functions using a set of functions that can be parametrized with finitely many parameters is intractable in the sense that it suffers from the curse of dimensionality, provided that the map which assigns to each Ck-function the parameters of the approximating function is continuous. This is the subject of this section.

In [35] a certain space of polynomials was used to show the intractability in the case of linear approximation methods. We are also going to use this class of polynomials, but combine it with Proposition E.1 to infer intractability in the case of continuous approximation methods. We start with a lemma discussing an important property of this space of polynomials. This property is stated as part of a proof in [35], but no complete proof is provided. Lemma E.4. Let s N and consider a function f C ([ 1, 1]s; R) which is given via

k {0,1}s akxk (E.3)

with coefficients ak R for every k {0, 1}s. Then it holds

f Ck([ 1,1]s;R) = f L ([ 1,1]s;R)

for every k N.

Proof. The proof is by induction over s. We start with the case s = 1 and note that we can write f(x) = ax + b with a, b R in that case. Switching to f if necessary, we can assume a 0. Clearly, f L ([ 1,1];R) |a| + |b|. Conversely, if b 0 then |f(1)| = |a + b| = |a| + |b|. If otherwise b < 0 then |f( 1)| = |b a| = |a b| = |a| + |b|. Thus, f L ([ 1,1];R) = |a| + |b|. For the derivatives, we have f L ([ 1,1];R) = |a| and f (k) L ([ 1,1];R) = 0 for k 2. This proves the claim in the case s = 1.

We now assume that the claim holds for some arbitrary but fixed s N. We further let α Ns+1 0 and fix a point (x1, ..., xs+1) [ 1, 1]s+1. We decompose α = (α , αs+1) with α Ns 0. Let

ef : [ 1, 1] R, ys+1 7 (α ,0)f(x1, ..., xs, ys+1)

and note αf(x1, ..., xs+1) = ef (αs+1)(xs+1).

Note that f is affine-linear with respect to each variable (with all other variables hold fixed). Hence, ef is an affine function and we can thus apply the case s = 1 to ef and get

ef (αs+1) L ([ 1,1];R) ef L ([ 1,1];R).

Putting this together, we infer

| αf(x1, ..., xs+1)| ef L ([ 1,1];R) = sup ys+1 [ 1,1] | (α ,0)f(x1, ..., xs, ys+1)|. (E.4)

We now fix an arbitrary point ys+1 [ 1, 1] and consider

bf : [ 1, 1]s R, (y1, ..., ys) 7 f(y1, ..., ys, ys+1).

Then it holds (α ,0)f(x1, ..., xs, ys+1) = α bf(x1, ..., xs).

Applying the induction hypothesis to bf (which is easily seen to be of the form (E.3)) we get

| α bf(x1, ..., xs)| bf L ([ 1,1]s;R) f L ([ 1,1]s+1;R). (E.5)

Combining (E.4) and (E.5) yields

| αf(x1, ..., xs+1)| f L ([ 1,1]s+1;R).

Since α Ns+1 0 was arbitrary, we get the claim by noting that

f Ck([ 1,1]s;R) f L ([ 1,1]s;R) holds trivially for every k N.

Using the above lemma, we can now deduce that the approximation of smooth functions using continuous approximation methods is intractable in terms of the input dimension.

Proof of Theorem 5.1. We apply Proposition E.1 to X := C([ 1, 1]s; R), V := C , ,s and to the set K := {f C , ,s : f C ([ 1,1]s;R) 1} and m := 2s 1. The space

[ 1, 1]s x 7 X

k {0,1}s akxk : ak R

consisting of all functions considered in the previous Lemma E.4 is an (m + 1)-dimensional subspace of C([ 1, 1]s; R). For every f Xm+1 with f L ([ 1,1]s;R) 1, Lemma E.4 tells us f C ([ 1,1]s;R) 1. Hence, U1(Xm+1) K and Proposition E.1 then yields the claim.

Remark E.5. The statement of Theorem 5.1 also holds if the functions satisfy a : C , ,s Rm and M : Rm C([ 1, 1]s; R) with m 2s 1. This can be seen by defining

a : C , ,s R2s 1, f 7 (a(f), 0, ..., 0)

and f M : R2s 1 C([ 1, 1]s; R), (a, b) 7 M(a)

with a Rm and b R2s 1 m.

The following Corollary E.6 transfers Theorem 5.1 to the complex-valued setting. Corollary E.6. Let n N. For any function f C (Ωn; C) we write

f C (Ωn;C) := sup k N f Ck(Ωn;C)

and let C , ,n C denote the space consisting of all functions for which this expression is finite. Let a : C , ,n C C22n 1 1 be continuous with respect to some norm on C , ,n C and moreover, let M : C22n 1 1 C(Ωn; C) be an arbitrary map. Then it holds

sup f C , ,n C f C (Ωn;C) 1

f M(a(f)) L (Ωn;C) 1.

Proof. The transfer to the complex-valued setting works in the same manner as the proof of Theorem 4.1 (see Appendix E.1). We write m := 22n 1 1 and note 2m = 22n 2 22n 1. We define a : C , ,2n R2m and f M : R2m C([ 1, 1]2n; R) in the same way as in the proof of Theorem 4.1. Using again the same technique as in the proof of Theorem 4.1, we get

sup f C , ,n C f C (Ωn;C) 1

f M(a(f)) L (Ωn;C) sup f C , ,2n

f C ([ 1,1]2n;R) 1

f f M( a( f)) L ([ 1,1]2n;R) 1,

applying Theorem 5.1 in the last inequality, using 2m 22n 1.

We conclude this appendix by adding a note on the constant appearing in our main approximation bound. Corollary E.7. Let n N with n 2 and α > 0 and let ϕ C(C; C). Let c = c(n, α) > 0 be such that for every m N there exists a mapping

η : C , ,n C (Cn)m Cm Cm, g 7 (η1(g), η2(g), η3(g))

that is continuous with respect to any norm on C , ,n C and such that

f Ψ(f) L (Ωn;C) ( c m) α f C (Ωn;C),

for every f C , ,n C . Here, Ψ(f) C(Ωn; C) is given by

j=1 (η3(f))j ϕ [η1(f)]T j z + (η2(f))j .

Then it necessarily holds c 16 2 n.

Proof. We first assume n 4. We take m = j 22n 1 1

n+2 k and note that then m(n + 2) 22n 1 1.

Therefore, Corollary E.6 applies and we infer that for each ε (0, 1), there exists f = fε C , ,n C with f C (Ωn;C) 1 and such that

1 ε f Ψ(f) L (Ωn;C) ( c m) α f C (Ωn;C) ( c m) α .

This then necessarily implies c m 1 or equivalently c 1/m. It therefore suffices to derive a lower bound for m. Firstly, we note

22n 1 = 2n 3 2n+2 = 2n 3 (1 + 1)n+2 2n 3(n + 3),

where we applied Bernoulli s inequality. Because of n 4 3, this yields

22n 1 1 2n 3(n + 3) 2n 3 = 2n 3(n + 2).

Hence, we get

n + 2 1 2n 3 1 = 2n 3(1 23 n) 2n 4 = 2n

Here, we used n 4 in the last inequality. An explicit computation shows that the same bounds also holds in the cases n = 2 and n = 3. This proves the claim.

F Postponed proofs for the optimality results in the case of unrestricted weight selection

F.1 Approximation using Ridge Functions

In this section we prove for s N 2 that every function in Ck([ 1, 1]s; R) can be uniformly approximated with an error of the order m k/(s 1) using a linear combination of m so-called ridge functions. In fact, we only consider ridge polynomials, meaning functions of the form

Rs R, x 7 p(a T x)

for a fixed vector a Rs and a polynomial p : R R. Note that this result has already been obtained in a slightly different form in [30, Theorem 1]; namely, it is shown there that the rate of approximation m k/(s 1) can be achieved by functions of the form Pm j=1 fj(a T j x) with aj Rs

and fj L1 loc(Rs). We will need the fact that the fj can actually be chosen as polynomials and that the vectors a1, ..., am can be chosen independently from the particular function f. This is shown in the proof of [30], but not stated explicitly. For this reason, and in order to clarify the proof itself and to make the paper more self-contained, we decided to present the proof in this appendix. Lemma F.1. Let m, s N. Then we denote by

Rs R, x 7 X

k Ns 0 |k| m

akxk : ak R

the set of real polynomials of degree at most m. The subset of homogeneous polynomials of degree m is defined as

Rs R, x 7 X

k Ns 0 |k|=m

akxk : ak R

Then there exists a constant c = c(s) > 0 satisfying

dim(Hs m) c ms 1 m N.

Proof. It is immediate that the set Rs R, x 7 xk : k Ns 0, |k| = m

forms a basis of Hs m, hence

dim(Hs m) = # {k Ns 0 : |k| = m} .

This quantity clearly equals the number of possibilities for drawing m times from a set with s elements with replacement. Hence, we see

dim(Hs m) = s + m 1 m

see for instance [12, Identity 143]. A further estimation shows s + m 1 m

(1 + m)s 1 2s 1 ms 1.

Hence, the claim follows with c(s) = 2s 1.

A combination of results from [36] together with the fact that it is possible to approximate Ckfunctions using polynomials of degree at most m with an error of the order m k, as shown in Theorem C.15, yields the desired result.

Theorem F.2. Let s, k N with s 2 and r > 0. Then there exists a constant c = c(s, k) > 0 with the following property: For every m N there exist a1, ..., am Rs \ {0} with aj 2 = r, such that for every function f Ck([ 1, 1]s; R) there exist polynomials p1, ..., pm P 1 m satisfying f(x)

j=1 pj(a T j x)

L ([ 1,1]s;R)

c m k/(s 1) f Ck([ 1,1]s;R).

Proof. We first pick the constant c1 = c1(s) according to Lemma F.1. Then we define the constant c2 = c2(s) := (2s)s 1 c1(s) and let M N be the largest integer satisfying

c2 M s 1 m.

Here, we assume without loss of generality that m c2, which can be justified by choosing pj = 0 for every j {1, ..., m} if m < c2, at the cost of possibly enlarging c. Note that the choice of M implies c2 (2M)s 1 c2 (M + 1)s 1 > m, and thus

2 c 1/(s 1) 2 m1/(s 1) = c3 m1/(s 1) (F.1)

with c3 = c3(s) := 1/2 c 1/(s 1) 2 .

Using [36, Proposition 5.9] and Lemma F.1 we can pick a1, ..., am Rs \ {0} satisfying

Hs s(2M 1) = span n x 7 (a T j x)s(2M 1) : j {1, ..., m} o , (F.2)

where we used that

c1 (s(2M 1))s 1 c1 (2s)s 1 M s 1 = c2 M s 1 m.

Here we can assume aj 2 = r for every j {1, ..., m} since multiplying each aj with a positive constant does not change the span in (F.2). From [36, Corollary 5.12] we infer that

P s s(2M 1) = span x 7 (a T j x)r : j {1, ..., m}, 0 r s(2M 1) . (F.3)

Let f Ck([ 1, 1]s; R). Then, according to Theorem C.15, there exists a polynomial P : Rs R of coordinatewise degree at most 2M 1 satisfying

f P L ([ 1,1]s;R) c4 M k f Ck([ 1,1]s;R),

where c4 = c4(s, k) > 0. Note that by construction it holds P P s s(2M 1). Using (F.3) we deduce the existence of polynomials p1, ..., pm : R R such that

j=1 pj(a T j x) for all x Rs.

Combining the previously shown bounds, we get f(x)

j=1 pj(a T j x)

L ([ 1,1]s;R)

= f(x) P(x) L ([ 1,1]s;R) c4 M k f Ck([ 1,1]s;R)

(F.1) c m k/(s 1) f Ck([ 1,1]s;R),

as desired. Here, we defined c = c(s, k) := c4 c k 3 .

F.2 Proof of Theorem 4.2

Using Theorem F.2, we can prove the following statement for complex-valued Ck-functions, which will play an important role in the proof of Theorem 4.2. Proposition F.3. Let n, k N. Then there exists a constant c = c(n, k) > 0 with the following property: For any m N there exist complex vectors b1, ..., bm Cn with bj 2 = 1/

2n for j = 1, ..., m and with the property that for any function f Ck (Ωn; C) there exist functions g1, ..., gm C(Ω1; C) such that f(z)

j=1 gj b T j z L (Ωn;C)

c m k/(2n 1) f Ck(Ωn;C) .

Note that the vectors b1, ...bm can be chosen independently from the considered function f, whereas g1, ..., gm do depend on f.

Proof. Theorem F.2 yields the existence of a constant c1 = c1(n, k) > 0 with the property that for any m N there exist real vectors a1, ..., am R2n with aj 2 = 1/

2n such that for any function f Ck [ 1, 1]2n; R there exist functions g1, ..., gm C([ 1, 1]; R) satisfying f(x)

j=1 gj a T j x L ([ 1,1]2n;R)

c1 m k/(2n 1) f Ck([ 1,1]2n;R).

We then define the vectors b1, ..., bm Cn componentwise via

(bj)ℓ:= (aj)ℓ i (aj)n+ℓ, ℓ {1, ..., n}, j {1, ..., m}.

First we see bj 2 = aj 2 = 1/

2n. We first consider real-valued functions, i.e., f Ck (Ωn; R). Let φn be defined as in (A.1). By the choice of the constant c1 we can find continuous functions g1, ..., gm C ([ 1, 1]; R) such that (f φn)(x)

j=1 gj a T j x L ([ 1,1]2n)

c1 m k/(2n 1) f φn Ck([ 1,1]2n;R) .

We then define gj C(Ω1; R) by gj(z) := gj (Re(z)) for any j {1, ..., m}. For z Ωn we then have

gj (bj)T z = gj

ℓ=1 (bj)ℓ zℓ

(aj)ℓ i (aj)n+ℓ φ 1 n (z)ℓ+ i φ 1 n (z)n+ℓ !!

h (aj)ℓφ 1 n (z)ℓ+ (aj)n+ℓφ 1 n (z)n+ℓ i!

= gj (aj)T φ 1 n (z) . (F.4)

Therefore, f(z)

j=1 gj b T j z L (Ωn;R)

j=1 gj b T j φn(x) L ([ 1,1]2n;R)

j=1 gj a T j x L ([ 1,1]2n;R)

c1 m k/(2n 1) f φn Ck([ 1,1]2n;R) .

By the above, for f Ck (Ωn; C) we can pick functions g Re 1 , ..., g Re m , g Im 1 , ..., g Im m C (Ω1; R) satisfying Re(f(z))

j=1 g Re j b T j z L (Ωn;R)

c1 m k/(2n 1) Re (f φn) Ck([ 1,1]2n;R) ,

j=1 g Im j b T j z L (Ωn;R)

c1 m k/(2n 1) Im (f φn) Ck([ 1,1]2n;R) .

Defining gj := g Re j + i g Im j yields f(z)

j=1 gj b T j z L (Ωn;C)

2 m k/(2n 1) f Ck(Ωn;C) ,

completing the proof.

The special activation function that yields the improved approximation rate of m k/(2n 1) (see Theorem 4.2) is constructed in the following lemma. Lemma F.4. Let {uℓ} ℓ=1 be an enumeration of the set of complex polynomials in z and z with coefficients in Q + i Q. Then there exists a function ϕ C (C; C) with the following properties:

1. For every ℓ N and z Ω1 one has

ϕ(z + 3ℓ) = uℓ(z).

2. ϕ is non-polyharmonic.

Proof. Let ψ C (C; R) with 0 ψ 1 and

ψ Ω1 1, supp(ψ) eΩ,

where eΩ:= z C : |Re (z)| , |Im (z)| < 3

2 . We then define

ℓ=1 uℓ( 3ℓ) ψ( 3ℓ),

where f(z) = e Re(z). Note that ϕ is smooth since it is a locally finite sum of smooth functions. Furthermore, ϕ is non-polyharmonic on the interior of Ω1, since the calculation in the proof of Proposition A.2 shows for z in the interior of Ω1 and ρ : R R, t 7 et that m wirt ℓ wirtϕ(z) = m wirt ℓ wirtf(z) = 1 2m+ℓ

ρ(m+ℓ)(Re(z)) > 0

for arbitrary m, ℓ N0. Finally, property (1) follows directly by construction of ϕ because

(eΩ+ 3ℓ) (eΩ+ 3ℓ ) =

for ℓ = ℓ .

Using the properties of the special activation function constructed in Lemma F.4 and applying the approximation result from Proposition F.3 we can now prove Theorem 4.2.

Proof of Theorem 4.2. Let ϕ be the activation function constructed in Lemma F.4. We choose the constant c according to Proposition F.3. Let m N and f Ck (Ωn; C). We can without loss of generality assume that f 0. Again, according to Proposition F.3, we can choose ρ1, ..., ρm Cn

with 2ρj = 1/

2n and g1, ..., gm C (Ω; C) with the property f(z)

j=1 gj ρT j z L (Ωn)

c m k/(2n 1) f Ck(Ωn;C) .

Recall from Lemma F.4 that {uℓ} ℓ=1 is an enumeration of the set of complex polynomials in z and z. Hence, using the complex version of the Stone-Weierstraß-Theorem (see for instance [21, Theorem 4.51]), we can pick ℓ1, ..., ℓm N such that gj uℓj L (Ω1;C) m 1 k/(2n 1) f Ck(Ωn;C) (F.5)

for every j {1, ..., m}. Since ϕ ( + 3ℓ) = uℓon Ω1 for each ℓ N, and since ρT j z Ω1 for j {1, ..., m} and z Ωn, we estimate f(z)

j=1 ϕ ρT j z + 3ℓj L (Ωn;C)

j=1 gj ρT j z L (Ωn;C)

gj ρT j z ϕ ρT j z + 3ℓj L (Ωn;C)

c m k/(2n 1) f Ck(Ωn;C) +

gj (z) uℓj (z) L (Ω1;C)

(F.5) c m k/(2n 1) f Ck(Ωn;C) + m k/(2n 1) f Ck(Ωn;C)

= (c + 1) m k/(2n 1) f Ck(Ωn;C).

F.3 Proof of Theorem 4.3

As a preparation for the proof of Theorem 4.3, we first prove a similar result in the real-valued setting. We remark that the proof idea is inspired by the proof of [46, Theorem 4]. Theorem F.5. Let n, k N and

ϕ : R R, ϕ(x) := 1 1 + e x

be the sigmoid function. Then there exists a constant c = c(n, k) > 0 with the following property: If the numbers ε (0, 1

2) and m N are such that for every function f Ck ([ 1, 1]n; R) with f Ck([ 1,1]n;R) 1 there exist coefficients ρ1, ..., ρm Rn, η1, ..., ηm R and σ1, ..., σm R satisfying f(x)

j=1 σj ϕ ρT j x + ηj L ([ 1,1]n;R)

then necessarily

Proof. We first pick a function ψ C (Rn; R) with the property that ψ(0) = 1 and ψ(x) = 0 for every x Rn with x 2 > 1

4. We then choose

c1 = c1(n, k) := ψ Ck([ 1,1]n;R) 1 .

Now, let ε (0, 1

2) and m N be arbitrary with the property stated in the formulation of the theorem.

6k , then m c ε n/k

ln(1/ε) trivially holds (as long as c = c(n, k) > 0 is sufficiently small). Hence, we can assume that ε c1

2 1 6k . Now, let N be the smallest integer with N 2, for which

c1 2k+1 N k ε.

Note that this implies

ε 1 2k+1 c1 2k+1 2

and hence N 3, whence N 1 2. Therefore, by minimality of N, and since N

2 N 1 because of N 2, it follows that

ε < c1 2k+1 (N 1) k c1 2k+1 2k N k = c1

2 N k. (F.6)

Now, for every α { N, ..., N}n pick zα {0, 1} arbitrary and let yα := zαc1N k. Define the function f(x) := X

α { N,...,N}n yα ψ (Nx α) , x Rn.

Clearly, f C (Rn; R). Furthermore, since the supports of the functions ψ( α), α Zn are pairwise disjoint, we see for any multi-index k Nn 0 with |k| k that kf L ([ 1,1]n;R) N |k| max α |yα| kψ L ([ 1,1]n;R)

N k max α |yα| ψ Ck([ 1,1]n;R) 1,

so we conclude that f Ck([ 1,1]n;R) 1. Additionally, for any fixed β { N, ..., N}n we see

by choice of ψ. See also Figure 5 for an illustration of the function f.

By assumption, we can choose suitable coefficients ρ1, ..., ρm Rn, η1, ..., ηm R and furthermore σ1, ..., σm R such that f g L ([ 1,1]n;R) ε

j=1 σj ϕ ρT j + ηj .

g := g( /N) =

ρT j N + ηj

we see for every α { N, ..., N}n that

yα ε = c1N k ε (F.6) > (c1/2) N k, if zα = 1,

yα + ε (F.6) < (c1/2) N k, if zα = 0.

Therefore, we get 1 g > (c1/2)N k (α) = zα for any α { N, ..., N}n. Since the choice of zα has been arbitrary, it follows that the set

H := n 1 g > (c1/2)N k { N,...,N}n : g of form (F.7) o

shatters the whole set { N, ..., N}n. Therefore, we conclude that

VC(H) (2N + 1)n N n.

On the other hand, [4, Theorem 8.11] shows that

VC(H) 2m(n + 2) log2(60n N) c3 m ln(N)

with a suitably chosen constant c3 = c3(n). Here we used that N 3 so that ln(N) ln(3) > 0. Combining those two inequalities yields

Using that N c4 ε 1/k with c4 := c4(n, k) = c1 2k+1 1/k and N c5 ε 1/k with the definition

c5 := c5(n, k) = c1

2 1/k, we see that

m cn 4 ε n/k

c3 ln c5 ε 1/k c6 ε n/k

with c6 = c6(n, k) > 0 chosen appropriately.

Figure 5: Illustration of the function f considered in the proof of Theorem F.5.

As a corollary, we get a similar result for complex-valued neural networks. Corollary F.6. Let n, k N and

ϕ : C C, ϕ(z) := 1 1 + e Re(z) .

Then there exists a constant c = c(n, k) > 0 with the following property: If ε (0, 1

2) and m N are such that for every function f Ck (Ωn; C) with f Ck(Ωn;C) 1 there exist coefficients ρ1, ..., ρm Cn, η1, ..., ηm C and σ1, ..., σm C satisfying f(z)

j=1 σj ϕ ρT j z + ηj L (Ωn;C)

then necessarily

Proof. We choose the constant c = c(2n, k) according to the previous Theorem F.5 and let φn as in (A.1). Then, let ε (0, 1

2) and m N with the properties assumed in the statement of the corollary. If we then take an arbitrary function f Ck [ 1, 1]2n; R with f Ck([ 1,1]2n;R) 1, we deduce the existence of ρ1, ..., ρm Cn, η1, ..., ηm C and σ1, ..., σm C, such that f(x) Re

j=1 σj ϕ ρT j φn(x) + ηj

L ([ 1,1]2n;R)

(f φ 1 n )(z)

j=1 σj ϕ ρT j z + ηj L (Ωn;C)

In the next step, we show that

j=1 σj ϕ ρT j φn(x) + ηj

is a real-valued shallow neural network with m neurons in the hidden layer and the real sigmoid function as activation function. Then the claim follows using Theorem F.5.

For every j {1, ..., m} we pick a matrix ρj R2n 2 with the property that one has

ρj T φ 1 n (z) = φ 1 1 ρT j z

for every z Cn. This is possible, since this is equivalent to

ρj T v = φ 1 1 (ρT j φn(v))

for all v R2n, where the right-hand side is an R-linear map R2n R2. Denoting the first column of ρj by bρj, we get bρj T φ 1 n (z) = Re ρT j z for all z Cn.

Writing γ for the classical real-valued sigmoid function (i.e. γ(x) = 1 1+e x ), we deduce for arbitrary x R2n that

j=1 σj ϕ ρT j φn(x) + ηj

j=1 σj γ Re ρT j φn(x) + ηj

j=1 σj γ bρj T x + Re (ηj)

j=1 Re (σj) γ bρj T x + Re (ηj) ,

where in the last step we used that γ is real-valued. As noted above, this completes the proof.

Now, we can finally prove Theorem 4.3.

Proof of Theorem 4.3. Let α = 2n

k and choose c2 = c2(α) = c2(n, k) > 0 such that the inequality ln(x) c2 xα/2 holds for all x 1. Furthermore, let c1 = c1(n, k) > 0 as in Corollary F.6. By choosing c = c(n, k) > 0 sufficiently small , we can ensure that εm := c (m ln(m)) k/(2n) satisfies

2 ln(1/εm) α

4 ln(1/εm) for all m N 2.

By further shrinking c = c(n, k) > 0 if necessary, we may assume

c (2 ln(2)) k/(2n) < 1

2 and hence c (m ln(m)) k/(2n) < 1

2 for all m N 2. Finally, setting c3 := α

4 and shrinking c even further, we can arrange that cα < c1 c3. Now, assume towards a contradiction that for every f Ck (Ωn; C) with f Ck(Ωn;C) 1 there are coefficients ρ1, ..., ρm Cn, σ1, ..., σm C and η1, ..., ηm C such that f(z)

j=1 σj ϕ ρT j z + ηj L (Ωn;C)

< c (m ln(m)) k/(2n) .

Applying Corollary F.6, we then get

m c1 ε 2n/k

with ε = c (m ln m) k/(2n) (0, 1

2) and c1 = c1(n, k) > 0. Recall from the beginning of the proof that α := 2n/k and that ln(x) c2xα/2 for every x 1. We observe

ln (1/ε) c1

which implies

ln(m) ln c1

2 ln(1/ε) α

4 ln(1/ε) = c3 ln(1/ε).

Overall we then get

ln(1/ε) = c1 c α m ln(m)

ln(1/ε) c1 c3 c α m ln(m)

ln(m) = c1 c3 c α m.

Since we chose c such that cα < c1 c3 we get the desired contradiction.